[RFC] High Level Vision for Storage in OpenSearch

andrross commented 2 years ago

Introduction

This issue proposes how remote storage can be used in order to improve the performance, cost, and scaling properties of OpenSearch. It intends to build upon the warm storage features that have been added to Amazon OpenSearch Service to solve tiered storage use cases as well as expand the capabilities to be applicable to more use cases.

State of the world today

OpenSearch follows a cluster model where multiple hosts are provisioned into a single logical cluster and are used to provide data durability, compute capacity, and storage capacity. Indexes are sharded in order to horizontally scale data across the cluster. Durability is provided by a primary/replica model where shards are replicated across multiple hosts in the cluster. The same primary/replica model enables scaling compute capacity because both primaries and replicas are able to serve search queries. Storage capacity can be increased by adding hosts to the cluster and therefore increasing the number hosts on which to place shards.

Off-cluster durability can be provided via the snapshot mechanism, though snapshots are periodic meaning they cannot be relied upon for durability without accepting the possibility of losing some recent writes. Snapshots cannot be queried directly by OpenSearch and must be restored to an index (i.e. copied from remote storage into cluster-based storage) in order to be queried.

Ultrawarm is a feature of Amazon OpenSearch Service that allows users to query index data that is stored in Amazon S3. It introduces the concept of the hot storage tier (i.e. cluster-based) and the warm storage tier (i.e. S3 based). Users are able to move an index from the hot tier to the warm tier, at which point it will no longer consume storage resources from the OpenSearch clusters. Users are required to provision dedicated compute capacity that is used to query the warm tier and can be scaled independently from the hot tier.

This cluster model works well for workloads with balanced compute and storage requirements, but is difficult for other workloads to scale in a cost efficient manner. UltraWarm is a first step in allowing storage to scale independently from the compute tier, but some of the architectural tradeoffs make it difficult to use for use cases other than log analytics. Our goal is to open source the applicable parts of UltraWarm while ultimately improving it based on what we’ve learned from it.

Vision for the future

Going forward we will continue to improve the performance and reduce the cost of OpenSearch where possible. Segment replication (https://github.com/opensearch-project/OpenSearch/issues/2229) is a project to replace document-based replication in order to better utilize host resources within a cluster. This will improve both indexing and search throughput for OpenSearch clusters. To enable further optimizations for a given use case, we will introduce mechanisms to specialize the functions of data nodes within a cluster, relying on the unique capabilities of remote cloud storage, in order to allow fine tuning the cluster topology to achieve the optimal balance of cost and performance.

When fully implemented, these features will allow users to create nodes specialized to perform indexing operations and nodes specialized to serve queries. The “indexer” nodes will function much the same as OpenSearch nodes do today, with the added capability to continually replicate their index data to remote storage. Part of this work is already underway in Add Remote Storage for Improved Durability (https://github.com/opensearch-project/OpenSearch/issues/1968). The nodes specialized for serving queries can then be configured to pull the index data from remote storage. This is work that will borrow from and build on top of the existing UltraWarm implementation. The end state allows for creating a system that can be highly optimized for specific use cases. Note that these features aim to add more flexibility; OpenSearch will remain easy to get started with and will continue to support running in the traditional zero-external-dependency cluster mode.

For example, a search application that needs very low latency query capability can configure the minimal number of writer nodes to handle the indexing requests, while scaling up a large number of read-only nodes tailored to serve the high rate of search requests. On the other hand, a log analytics type system can leverage these features to build a warm tier-like solution. A more balanced fleet of writers nodes can be configured as the “hot” tier for the frequently mutated indexes, while queries to all indexes will be be served by the read-only nodes. No explicit migration of data has to happen in this scenario because the writer nodes have been continually uploading index data to the remote store.

What are the things that need to change to get to the this end state

The end-to-end user experience and APIs need to be defined for these features, i.e.

How do I configure an index to be sync’d to remote storage?
How do I configure read-only nodes that will read from data in remote storage?
How do I direct queries between the write nodes and read-only nodes? (writes will go to the node with the primary copy of the shard, but if I have read-only nodes configured then a read could plausibly go to a “normal” writer node or a read-only node)
How do I remove an index from the writeable fleet but keep it queryable from the read only nodes?
How does cold storage (i.e. non-queryable indexes that are able to be promoted into a queryable tier) fit in?

For the write path, the Add Remote Storage for Improved Durability (https://github.com/opensearch-project/OpenSearch/issues/1968) project is building most of the pieces that will be required. This project is implementing the low-level pieces, but how it will integrate into the end-to-end user experience has not yet been defined.

For the read path, we will take much of the work that has been done in UltraWarm, but will need to expand on it to work as a more general feature. Namely, the solution cannot assume that index data is immutable and will need to periodically refresh to pull the latest data from the remote store. Additionally, UltraWarm currently merges an index to a single segment to optimize performance during reads. This will not be possible in all cases for this design, so we will likely need to implement concurrent searching of segments to get acceptable performance.

How Can You Help?

Any general comments about the overall direction are welcome. Some specific questions:

What are your use cases where this architecture would be useful?
What specific remote storage systems would you like to use with OpenSearch?
Have you implemented any work-arounds outside of OpenSearch that could better be solved by this architecture?

Next Steps

We will incorporate the feedback from this RFC into a more detailed proposal/high level design that will integrate the storage-related efforts in OpenSearch. We will then create meta issues to go in more depth on the components involved to continue the detailed design.

jkowall commented 2 years ago

This is really wonderful work, and makes sense. I like the new pattern for implementing read-heavy workloads. How will ISM factor into the design of this for moving data to S3 for example? Will that be required? I didn't see it mentioned here.

andrross commented 2 years ago

How will ISM factor into the design of this for moving data to S3 for example? Will that be required?

There's a version of this where the index data is being continually sync'd to remote storage and therefore no explicit migration of data has to happen. But we're working on further refining all the use cases and definitely welcome feedback in this area.

reta commented 2 years ago

Thanks for summarizing the whole bunch of discussions @andrross , I have one question / note though:

For the read path, we will take much of the work that has been done in UltraWarm, but will need to expand on it to work as a more general feature. Namely, the solution cannot assume that index data is immutable and will need to periodically refresh to pull the latest data from the remote store. Additionally, UltraWarm currently merges an index to a single segment to optimize performance during reads.

Wasn't the goal of UltraWarm to manage read-only / frozen indices in the first place? I believe the case when clients have to manage large datasets (fe for compliance reasons) but at the same time only keep the small subset of the indices hot is the most common scenario, the storage cost reduction could be tremendous (think of S3 / Glacier). Otherwise, it looks to me as a regular pluggable storage (discussed in https://github.com/opensearch-project/OpenSearch/issues/1968), wdyt?

andrross commented 2 years ago

Wasn't the goal of UltraWarm to manage read-only / frozen indices in the first place?

The idea here is to build an architecture that can solve the use cases for which UltraWarm currently works well, while also handling cases where the read-only / frozen nature is problematic (i.e. read heavy cases where infrequent updates are necessary).

andrross commented 2 years ago

Thanks everyone for the feedback. We will keep this issue open for one more week (until 5/13/2022) and then close this discussion. We'll continue tracking the various design and implementation tasks in linked issues.

opensearch-project / OpenSearch