Add fetch phase to search profile

andrross commented 2 years ago

Is your feature request related to a problem? Please describe. A significant performance regression exists for some use cases in the fetch phase between ES 7.9 and OS 1.0. The root cause was a change to the Lucene codec and it has been mitigated by a change within Lucene that is present in OS 1.2. See issue #1647 for more details. I was able to profile the JVM using Java Flight Recorder and the decompression during the fetch phase stood out as an obvious change, but this would have been much easier if the search profile results had contained timing metrics on the fetch phase.

Describe the solution you'd like The "profile" section of the query response should contain information about the fetch phase.

Describe alternatives you've considered Profling the JVM can give a lot of insight into where time is being spent, but is a rather complicated process and requires a lot of knowledge of the Java development ecosystem.

rishabhmaurya commented 4 months ago

Some other aspects of improving the profile output in general -

Add support for individual functions in FunctionScoreQuery. Currently we don't have breakdown on what happens in individual functions and given a function is comprised of a regular query and score manipulation logic, it can sometimes takes a significantly longer time and we are blind if there are multiple such functions to identify the bottleneck.
It would be nice if we can also get segment level breakdown in profile output if that's something feasible.
Integration of resource tracking into profile output when enabled https://github.com/opensearch-project/OpenSearch/issues/12399. We need to think more on how it can be integrated, profile output breaks it down to individual clauses of the query to the lowest level query in lucene. Given we don't have that granular information from resource query resource tracking, it might not be possible to integrate it down the tree to lucene queries but we can definitely add information at shard level breakdown. cc @ansjcy @getsaurabh02

ansjcy commented 4 months ago

I agree with @rishabhmaurya 's suggestions! Currently as part of the efforts to capture query-level resource usage metrics and the coordinator node level took time, we are able to get phase-level breakdown on latency and resource usages, but we can go even deeper and get insights on the resource usages / time consumptions for some critical functions.

opensearch-project / OpenSearch

Add fetch phase to search profile #1764