opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.77k stars 1.82k forks source link

[RFC] Support for writable warm indices on Opensearch #12809

Open ankitkala opened 7 months ago

ankitkala commented 7 months ago

Is your feature request related to a problem? Please describe

OpenSearch supports remote backed storage which allows users to backup the cluster's data on a remote storage thus providing stronger durability guarantee. The data is always hot(present locally on disk) and remote storage acts as a backup.

We can also leverage the remote store to support shard level data tiering where all data is not guaranteed to be present locally but instead can be fetched at runtime on demand basis(let's say warm shard). This will allow the storage to scale separately from compute and user would be able to store more data with same number of nodes.

Benefits:

Describe the solution you'd like

What is Writable warm: A warm index is an OpenSearch index where the entire index data is not guaranteed to be present locally on disk. The shards are always open & assigned on one of the eligible nodes, the index is active and the metadata is part of the cluster state.

Possible configurations:

Scope:

Includes:

Excludes(will be covered as a separate task):

Core Components:

TieredCache Node view (1) drawio (1)

Composite Directory: The Directory abstraction provides a set of common APIs for creating, deleting, and modifying files within the directory, regardless of the underlying storage mechanism. Composite Directory will abstract out all the data tiering logic and makes the lucene index(aka shard) data locality agnostic.

File Cache: File Cache will track the lucene files for all shards(using composite directory) at node level. Implementation for FileCache is already present (from searchable snapshot integration). We'll need to build support for additional features like cache eviction, file pinning, tracking hot indices, etc.

Index Inputs: For all read operations composite directory should be able to create/send different index inputs. For e.g:

Prefetcher: To maintain the optimal read & write performance, we want to cache some part of data to improve various shard level operations(read, write, flush, merges, etc). This component should be able to trigger prefetching of certain files/blocks when the shard opens or the replica refreshes.

User flows:

High level changes (more details to be added in individual issues):

Performance optimizations:

Other considerations:

Hot/warm migration strategy:

To be able to provide the best user experience for data tiering(hot/warm/cold), we should be able to seamlessly migrate the cluster from hot to warm. To support this, here are few key decisions:

Supporting custom codecs

One caveat with prefetching is that to prefetch certain blocks you might need to read the files using methods which are not really exposed by the respective readers in lucene. Few examples:

Future Enhancements

Here are some enhancements to explore after the performance benchmarking.

FAQs:

Q: How are files evicted from disk? Files can be evicted from disk due to:

Q: How would merges work for warm shard? Merges can require downloading the files from remote. Eventually we'd want to have support for offline merge on a dedicated node(related issue).

Q: How to determine the essential data to be fetched for a shard? We can actively track the most used block files based on FileCache statistics. This information can be retained in remote store which creates the set of essential files to be downloaded while initializing a shard.

ankitkala commented 7 months ago

@sohami @mch2 @neetikasinghal @andrross

sohami commented 7 months ago

Thanks for creating this.

File Cache: File Cache will track the lucene files for all shards(using composite directory) at node level. Implementation for FileCache is already present (from searchable snapshot integration). We'll need to build support for additional features like cache eviction, file pinning, tracking hot indices, etc.

Keeping file cache same for hot/warm indices sounds great but currently the cache has locking semantics. Probably that will impact the existing hot indices performance specially around search since write will mean creating new segments at refresh interval ? I think this will help us to decide if FileCache should track hot indices files as well or not. One benefit with common FileCache is to manage the entire local disk space dynamically between hot/warm indices instead of defining a static configs. But probably we can achieve that by keeping a common tracking component for disk usage between hot/warm indices. FileCache can manage warm indices related local files and provide mechanisms to trigger eviction when invoked by this tracking component along with exposing the usage statistics ? This can be later used by disk usage based backpressure monitors or other consumers like migration components to decide if more indices can be moved to warm tier or not (both dedicated vs non-dedicated setup).

Index Inputs: For all read operations composite directory should be able to create/send different index inputs. For e.g:

This will be dependent on index type or specific file of an index ? For hot if all data is locally present then it will use IndexInput. For warm if data is locally or downloaded on request time won't it use the same BlockIndexInput ?

Prefetcher: To maintain the optimal read & write performance, we want to cache some part of data to improve various shard level operations(read, write, flush, merges, etc). This component should be able to trigger prefetching of certain files/blocks when the shard opens or the replica refreshes.

Not clear if this component will be used to prefetch these files before making a shard active or it will be sort of async prefetch and shard will be activated irrespective of files are prefetched or not. One thing which will be useful would be to evaluate what all shard level data/metadata is must-have to make it active and ready to serve different type of requests like search/indexing and including requests for segment/index level stats. Otherwise these requests may end up waiting on the download/prefetch to complete (which could take time when multiple shards are performing prefetch) and eventually fail even though shard is deemed active.

For seamless transition from hot to warm, we'll can use CompositeDirectory & FileCache for hot files as well. From Composite Directory perspective, it should be able to operate with a dynamic index level setting which defines the percentage of data to retain on disk(100% being hot).

Users can still create hot remote store based indices without composite directory or we will make composite directory the default for such indices along with mechanism to migrate existing indices to use the composite directory ?

peternied commented 7 months ago

[Triage - attendees 1 2 3 4 5 6 7] @ankitkala Thanks for creating this RFC, looking forward to seeing how this evolves.

ankitkala commented 7 months ago

Keeping file cache same for hot/warm indices sounds great but currently the cache has locking semantics. Probably that will impact the existing hot indices performance specially around search since write will mean creating new segments at refresh interval ? I think this will help us to decide if FileCache should track hot indices files as well or not.

We'll need to move away from Segmented Cache with LRU level write lock. We can use File level locking mechanisms instead which will ensure that there isn't any contention.

One benefit with common FileCache is to manage the entire local disk space dynamically between hot/warm indices instead of defining a static configs. But probably we can achieve that by keeping a common tracking component for disk usage between hot/warm indices.

Regarding the common tracking component for disk usage, that is definitely one alternative we were considering. Can be discussed in more details when we start working on supporting warm indices without dedicated warm nodes.

FileCache can manage warm indices related local files and provide mechanisms to trigger eviction when invoked by this tracking component along with exposing the usage statistics ? This can be later used by disk usage based backpressure monitors or other consumers like migration components to decide if more indices can be moved to warm tier or not (both dedicated vs non-dedicated setup).

Makes sense. We're starting with File Cache tracking only warm files for now. Exposing cache stats is also planned already which can be consumed by mechanisms that you're mentioned above.

Index Inputs: For all read operations composite directory should be able to create/send different index inputs. For e.g:

This will be dependent on index type or specific file of an index ? For hot if all data is locally present then it will use IndexInput. For warm if data is locally or downloaded on request time won't it use the same BlockIndexInput ?

Yes, this mostly depends on the state of the file. Apart from the 2 IndexInput that you mentioned,

Prefetcher: To maintain the optimal read & write performance, we want to cache some part of data to improve various shard level operations(read, write, flush, merges, etc). This component should be able to trigger prefetching of certain files/blocks when the shard opens or the replica refreshes.

Not clear if this component will be used to prefetch these files before making a shard active or it will be sort of async prefetch and shard will be activated irrespective of files are prefetched or not. One thing which will be useful would be to evaluate what all shard level data/metadata is must-have to make it active and ready to serve different type of requests like search/indexing and including requests for segment/index level stats. Otherwise these requests may end up waiting on the download/prefetch to complete (which could take time when multiple shards are performing prefetch) and eventually fail even though shard is deemed active.

Intention was to encapsulate the prefetch related logic here. For essential data, we'll start with not prefetching anything and then iterating on it based on performance benchmarking. This should help in deterministically define the essential files to prefetch before marking shard as active.

For seamless transition from hot to warm, we'll can use CompositeDirectory & FileCache for hot files as well. From Composite Directory perspective, it should be able to operate with a dynamic index level setting which defines the percentage of data to retain on disk(100% being hot).

Users can still create hot remote store based indices without composite directory or we will make composite directory the default for such indices along with mechanism to migrate existing indices to use the composite directory ?

We'll need to make composite directory as the default if we really want to support seamless hot/warm migration. One way to do this can be behind a cluster level setting which enables this for all hot remote backed indices.

sohami commented 7 months ago

Regarding the common tracking component for disk usage, that is definitely one alternative we were considering.

From my understanding, having default composite directory for hot/warm indices vs using FileCache for both seems to be 2 different independent items. I am fine with keeping composite as default (while allowing existing indices to seamlessly move towards that). However, for FileCache lets call out this alternative and we can evaluate both of these. With file level locks there could be additional overhead of locks per file and managing those (probably some benchmark will help to make these choices). So if the other alternative doesn't have any serious cons then we can use that.

Yes, this mostly depends on the state of the file. Apart from the 2 IndexInput that you mentioned,

we need separate IndexInput which will be put in the FileCache. Index Input returned by the directory would be a wrapper over this (one index input should be able to prefetch multiple blocks which internally can be separate index inputs in file cache).

I was assuming BlockIndexInput or the existing FileCachedIndexInput is the one which is essentially a wrapper in FileCache over block level files. IndexInput being able to prefetch multiple blocks probably seems to be some sort of optimizations for merge case ? Otherwise, merge process should just go over different sections of the segment files and download on demand and perform the merge as it goes since from lucene side the contracts are defined in terms of Directory and IndexInputs/Outputs for any IO.

We also need IndexInput for newly created files to ensure that these are not evict-able unless the file has been uploaded to the remote store.

Got it. This seems more of a FileCache semantic, to provide pinning mechanism for some of the key. Can it be implemented in the cache itself instead of exposing to IndexInput layer. That way the mechanism will be generic and independent on the value type stored in the cache.

We can discuss this in detail with each individual issue.

Sure

andrross commented 7 months ago

@ankitkala @sohami Regarding the locking semantics of the FileCache, we need to ensure we're not over-designing here. Conceptually, the file cache is an in-memory structure that maintains a small amount of state for each file on disk (a size, a reference count, some kind of ordering to implement LRU, etc). We have a microbenchmark that shows we can get over 1000 operations per millisecond out of a single segment of this cache. My intuition is that the overhead of maintaining the file cache accounting is pretty negligible in the grand scheme of things.

navneet1v commented 4 months ago

For optimizing search path, we might want to prefetch the StoredFields for certain documents which requires getting the file offset for the document ID. These use case require us to support methods which aren't exposed thus requiring us to maintain lucene version specific logic to support the prefetch operation.

@ankitkala will there be a new codec that will be added to support writable warm tier? By reading this, I feel we might be adding new codec which will for most of the code use the Lucene Codec but will also expose functions for doing prefetch. Is this understanding correct?