Open mgodwan opened 1 year ago
@mgodwan -- The title of the issue says "evaluate", but then it proposes making the improvement.
Have you done the evaluation? If not, I think it would be relatively easy -- on a single node cluster, you can start generating indexing traffic, enable Java Flight Recorder, record for a few minutes, turn off JFR (or launch it with a specified recording time in the first place), then stop generating index traffic. From the recording, you can get a CPU flame graph to see how much time is spent in IndexShard#prepareIndex
versus everything else.
My intuition is that a lot more work is probably done during the addDocs/updateDocs
steps (e.g. running Analyzers), but if prepareIndex
is expensive, then we could avoid doing it redundantly.
@msfroh I've already done some analysis. I will share the details.
In the past, we have seen some workloads(e.g. nyc taxis, http logs) taking upto 15-20% of CPU for parsing/prepareIndex
In the past, we have seen some workloads(e.g. nyc taxis, http logs) taking up to 15-20% of CPU for parsing/prepareIndex
Wow! That sounds like this could be a nice, comparatively-easy win. Please do share those details -- we should probably prioritize this work.
Is your feature request related to a problem? Please describe. Today, when a document is replicated across multiple shards during indexing, each shard is responsible for parsing the source document to create fields. This parsing can be avoided by letting primary pass the parsed document instead of the source document. This will save the CPU usage for
IndexShard#prepareIndex
step which happens during the indexing, and may provide better throughput with current document replication.Describe the solution you'd like Explore if we can have some way to pass the parsed document to replica shards in document replication mode to reduce the work that replicas need to perform.