[RFC] OpenSearch Data Format

penghuo commented 11 months ago

User Pain point

Not Open Format Dataset: Currently, all the data indexed in the OpenSearch service. Users rely on the OpenSearch service to access and retrieve the dataset.
Lacking segregation of read and write operations : using single cluster serve Index traffic and query traffic.
Complex data management workflow: usually, user need to setup pipeline to (1) ingest data to S3, and then (2) ingest to OpenSearch and (3) then manage index lifecycle with customized rules.

The index themselves are encoded with Lucene format of each shard.
The metadata is in object store. The metadata also include skipping index such as mix/max for each shard.
No server need to be running on to maintain OpenSearch Index. Transactions are achieve using optimistic concurrency protocol.
User only need to launch server when run queries, and benefits of separately scaling compute and storage.

penghuo commented 11 months ago

Demo Streaming application
- Write 1 shard to fs every 5s
- Write skipping index for each shard.
Reuse searchable snapshot interface
- restore unassigned shard from fs every 10s.
Rewrite DSL query WITH skipping index

Screenshot 2023-06-20 at 7 21 23 AM

dai-chen commented 11 months ago

Here is the demo video that covers the following topic:

OpenSearch Data Format proposed in this issue that remove hard dependency on OpenSearch cluster and separate read and write path
Virtual / External Index that makes data set on object store accessible to OpenSearch. Please find more details in https://github.com/opensearch-project/sql/issues/1080
Skipping Index that avoids unnecessary shard load and scan. Please find more details in https://github.com/opensearch-project/opensearch-spark/issues/2

schenksj commented 11 months ago

this is very cool! has any progress been made on the spark sql execution datasources side?