[RFC] Separation of compute & storage

bryanlb commented 5 days ago

Introduction

Modern distributed search engines like OpenSearch, ElasticSearch and Apache Solr were not designed from the ground up to truly take advantage of the cloud’s elasticity, neither were they built to leverage building blocks that have become foundational pieces in public cloud provider offerings like object storage.

Our proposal is to modify OpenSearch to adopt a cloud native architecture, separating compute from durability and storage. The durability of unindexed data would be provided by a persistent queue like Apache Kafka, and the storage for indexed data would be provided by object storage like Amazon S3.

This also enables alternate architectures, such as a deployment that does not keep a hot tier of data nodes, but use cold storage that streams results directly from object storage.

Goals

Cluster elasticity - by using object storage and removing the need for replicating data between OpenSearch nodes it provides the ability to scale up and scale down easily.
Selective resource allocation - compute resources can selectively allocated depending on the nature of the workload, enabling additional query throughout, ingest capacity, or both as needed.
Increase scale - enable easily scaling OpenSearch deployments into the 10s or 100s of Petabytes or more.
Increase cluster resilience - nodes can be quickly replaced, without impacting the availability of the cluster.

Non-Goals

Reduced operating cost - while beneficial, this should not be seen as a primary motivation for adopting this architecture.

Proposed architecture

We propose moving from the existing cluster model architecture to a stateless node architecture, using an event stream / write ahead log for unindexed data, and using object storage as the durable storage.

Ingest nodes - accept bulk ingest requests, submit to an event stream Event stream - Apache Kafka, Apache Pulsar, etc used as a write head log Indexing nodes - consume from the event stream to create indexes and upload to object storage Data nodes - fetch indexes from object storage and make available to query. Optionally can stream indexes directly from object storage. Coordinating nodes - perform scatter / gather of from data nodes Metadata store - Apache Zookeeper, etc used as a centralized store for node, index discovery Manager node - manages operation of the cluster

flowchart LR
  ingest(ingest node) -- event stream --> index 
  index(index node) --> objectstore
  objectstore{{object store}} --> datanode

  coord(coordinating node) <--> datanode
  datanode(data node)

Indexers and data nodes all communicate via a cluster manager and do not replicate any data between themselves.

Discussion

Should ingest nodes be queryable? If they are not queryable this may necessitate the introduction of a real-time node that can make results available quicker, potentially also bypassing the event stream.
Should data nodes perform both hot and frozen searches? Should this be responsibility be split into separate node types dedicated to their respective functions?
In this approach some mapping issues may not be discovered at ingest, and only caught during indexing. Would this be a problem for most users, and how should this be handled?

Summary

We believe moving towards a stateless node architecture will enable operators of OpenSearch deployments to more quickly adapt to changing workload requirements, improve cluster resource utilization, and enable scaling to larger deployments.

References

Slack Astra Search Engine - https://slackhq.github.io/astra/architecture.html#system-overview The Snowflake Elastic Data Warehouse - https://dl.acm.org/doi/10.1145/2882903.2903741

Proposal co-authored by @vthacker and @bryanlb for @slackhq

Pallavi-AWS commented 5 days ago

Thanks @bryanlb for creating this RFC on stateless node architecture. We will join hands with the work going on for reader/writer separation under https://github.com/opensearch-project/OpenSearch/issues/14596 (cc: @sohami @andrross @mch2 @getsaurabh02 @msfroh)

getsaurabh02 commented 4 days ago

Thanks @bryanlb for starting this RFC. Its well-structured proposal, highlighting the significant benefits of separating compute and storage. It aligns with the Reader and Writer Separation RFC, which also advocates for dedicated node roles, moving us in the same direction.

The high-level goals, such as traffic segregation, separation of concerns for resilience, and independent scalability, are substantial. Ability to scale independently adds significant value from an infrastructure perspective, allowing the use of heterogeneous instance types for different node roles. Additionally, this architecture enables us to tackle more complex problems going forward, such as implementing independent sharding schemes for readers and writers based on traffic patterns (or shard heat). Also, performing post-processing tasks like creating rollups or high-level pre-compute caches/indices for improved read performance can be achieved in better isolation.

The use of object storage for indexed data and a persistent queue like Apache Kafka for unindexed data ensures durability and scalability. It also addresses the indexing scale problem in today's world. With Pull based indexing approach, we can dynamically allocate resources based on workload characteristics, which will help handling varying query loads and ingest rates.

Furthermore, revamping the metadata store should be broadly considered in both proposals. It's also an opportunity to segregate the cluster state with more concise and relevant information based on node roles.

One thing to consider is the potential increase in read after write latency, especially when fetching indexes from object storage? It might be worth to think what strategies can we employ to optimize the performance of real-time queries in this new architecture?

kogent commented 4 days ago

One thing to consider is the potential increase in read after write latency, especially when fetching indexes from object storage? It might be worth to think what strategies can we employ to optimize the performance of real-time queries in this new architecture?

i think that is called out in the question from the proposal:

Should ingest nodes be queryable? If they are not queryable this may necessitate the introduction of a real-time node that can make results available quicker, potentially also bypassing the event stream.

getsaurabh02 commented 3 days ago

Adding @msfroh @andrross @sohami @mch2 for their feedback/comments.

opensearch-project / OpenSearch