Not Open Format Dataset: Currently, all the data indexed in the OpenSearch service. Users rely on the OpenSearch service to access and retrieve the dataset.
Lacking segregation of read and write operations : using single cluster serve Index traffic and query traffic.
Complex data management workflow: usually, user need to setup pipeline to (1) ingest data to S3, and then (2) ingest to OpenSearch and (3) then manage index lifecycle with customized rules.
Proposed Solution
The index themselves are encoded with Lucene format of each shard.
The metadata is in object store. The metadata also include skipping index such as mix/max for each shard.
No server need to be running on to maintain OpenSearch Index. Transactions are achieve using optimistic concurrency protocol.
User only need to launch server when run queries, and benefits of separately scaling compute and storage.
Technical Challenge
OpenSearch Data Format on Object Store
OpenSearch Data Format structure on object store
Metadata specification. Using transaction log to record actions. The actions include (1) Add/Remove (2) Metadata Change
Access Protocols
Optimistic concurrency protocols
Serializable isolation
Implement OpenSearch Data Format Writer/Reader as library
Implement Virtual Index in OpenSearch which attached to OpenSearch Data Format
Implement DSL query rewrite logic with skipping index.
User Pain point
Proposed Solution
Technical Challenge