Performance Improvement: Document Parsing in OpenSearch

mgodwan commented 1 year ago

Is your feature request related to a problem? Please describe. During indexing, OpenSearch spends time in two major phases, the first being document parsing (i.e. converting the input JSON document to Lucene data types by parsing through the entire json based on mappings) and the actual indexing. Document parsing can contribute to 20-50% of the total CPU usage during index based on the workload, and the entire parsing process is something we can improve by leveraging more efficient parsing techniques which can utilize the modern capabilities provided by the underlying architecture using SIMD/SWAR techniques. Few of the libraries already have implementations which leverage these techniques and can be used within OpenSearch to improve the compute cost for the document parsing, and/or JSON serde in general. Few of the implementations which leverage these techniques are:

https://github.com/plokhotnyuk/jsoniter-scala (Scala implementation)
https://github.com/simdjson/simdjson (C implementation)

A basic POC by replacing the usage of Jackson in document parsing path with a custom wrapper over jsoniter-scala with SAX style parser saw better latencies and throughput ranging from 4.5-6% with a few initial runs on the nyc_taxis workload.

Describe the solution you'd like Introduce JSON XContent Parsing/Generation leveraging jsoniter-scala library/alternatives which leverage SIMD/SWAR techniques for better throughput and latency. JSON Parsing is critical to the overall indexing speed and dealing with JSON happens at multiple places with OpenSearch in search path as well, and improving this execution flow may provide us significant gains.

anasalkouz commented 1 year ago

Hi @mgodwan, thanks for the proposal. I have few comments

A basic POC by replacing the usage of Jackson in document parsing path with a custom wrapper over jsoniter-scala with SAX style parser saw better latencies and throughput ranging from 4.5-6% with a few initial runs on the nyc_taxis workload.

Could you share the details of the benchamrks?

I believe this going to be less beneficial for SegRep, since replica shards are not doing any document parsing. Could you run some benchmarks with SegRep enabled

piotrrzysko commented 5 months ago

Hi, I'd like to let you know that there is a port of simdjson to Java: https://github.com/simdjson/simdjson-java. The project is still in a relatively early stage of development, so it might lack some features. However, I'm open to exploring and potentially adding what is missing. Would you be interested in verifying whether the library is a good fit for your use case?

opensearch-project / OpenSearch

Performance Improvement: Document Parsing in OpenSearch #7574