[RFC] Support for writable warm indices on Opensearch

Is your feature request related to a problem? Please describe

OpenSearch supports remote backed storage which allows users to backup the cluster's data on a remote storage thus providing stronger durability guarantee. The data is always hot(present locally on disk) and remote storage acts as a backup.

We can also leverage the remote store to support shard level data tiering where all data is not guaranteed to be present locally but instead can be fetched at runtime on demand basis(let's say warm shard). This will allow the storage to scale separately from compute and user would be able to store more data with same number of nodes.

Benefits:

Lower cost to customers as they can store more data per node.
Can be further leveraged for building data tiering like hot, warm & cold(cold being data archived on a durable storage. exact terminologies can be decided later)
Fundamental building block for OpenSearch to be able to support serverless architecture with a writable remote storage.

Describe the solution you'd like

What is Writable warm: A warm index is an OpenSearch index where the entire index data is not guaranteed to be present locally on disk. The shards are always open & assigned on one of the eligible nodes, the index is active and the metadata is part of the cluster state.

With warm index support, the lucene index(shard) can operate by retaining only the "essential" data on disk and fetch the remaining data from remote store on-demand basis.
Since the size of these segment files can be large(max 5 gb in current configuration), downloading entire files in critical write/search path would incur higher latencies(depending on file sizes & network). To minimize the impact, it should support downloading smaller blocks of data based on access pattern (Block level fetch).

Possible configurations:

Non-dedicated warm node: All hot & warm shards reside on same node.
Dedicated warm node: All the warm shards would reside on dedicated warm nodes(nodes with a dedicated node role warm). This should help improve resource isolation between hot and warm shards and thus lower blast radius.

Scope:

Includes:

Ability to read/write on a warm index.
Support warm replicas.
File cache management.
Indexing backpressure.

Excludes(will be covered as a separate task):

Tier management & APIs
Shard rebalancing logic.
Cluster manager optimizations.
Optimizations on the search path.

Core Components:

TieredCache Node view (1) drawio (1)

Composite Directory: The Directory abstraction provides a set of common APIs for creating, deleting, and modifying files within the directory, regardless of the underlying storage mechanism. Composite Directory will abstract out all the data tiering logic and makes the lucene index(aka shard) data locality agnostic.

File Cache: File Cache will track the lucene files for all shards(using composite directory) at node level. Implementation for FileCache is already present (from searchable snapshot integration). We'll need to build support for additional features like cache eviction, file pinning, tracking hot indices, etc.

Index Inputs: For all read operations composite directory should be able to create/send different index inputs. For e.g:

IndexInput for local files.
Blocked IndexInput for remote files.
Non-block IndexInput for remote files.
IndexInput associated with a single downloaded block in FileCache.

Prefetcher: To maintain the optimal read & write performance, we want to cache some part of data to improve various shard level operations(read, write, flush, merges, etc). This component should be able to trigger prefetching of certain files/blocks when the shard opens or the replica refreshes.

User flows:

Indexing flow:
- Write requests are written to indexing buffer, segment files are created during refresh which are then synced to disk.
  - FileCache will start tracking the locally created files.
- Files are synced to remote store with each refresh.
  - After files are uploaded, Composite Directory updates the file state. Files are now marked as eligible for eviction.
  - Eventually we'd want the file sync logic to be abstracted behind the composite directory itself so that data transfer in both directions should be encapsulated in the directory.
Replication & Recovery flow:
- During recovery from remote store, we should only download the segment info and open the reader. All the required files are fetch at runtime(can be prefetched as an optimization)
Search flow:
- For any file read which is not present locally, File cache will download the block for reads at runtime.

High level changes (more details to be added in individual issues):

Create warm index and integrate with FileCache & composite directory (link)
Add replication & recovery flow for warm indices (replication, shard relocation, primary promotion).
File Cache eviction strategy.
Exposing cache stats & eviction metrics.
Supporting encryption for block based downloads.
Indexing flow optimizations (based on performance brenchmarking).
Prefetch optimizations.
Introducing Indexing backpressure for warm shard.
Separate read/write threadpool for warm indices(must have for non-dedicated setup).

Performance optimizations:

Cache header & footer blocks for each segment file
- Requires special handling for compound files.
- Block size(e.g. 8 MB) would be bigger than the header/footer(let’s say 32 bytes). If we’re to always retain header & footer blocks for each file in FileCache, we’ll be caching much more data that we actually need. We can separately handle the header & footer block download with smaller sizes.
Tune the TTL for newly created segment files to minimize redownloading the segment during merges(on best-effort basis)
Tuning for merges:
- Tune merge parameters separately for warm indices behind dedicated settings.
- Based on performance benchmarks, we can optimize toward 1. lowering the number of segments for optimal search performance, and 2. increasing the max segment sizes.

Other considerations:

Hot/warm migration strategy:

To be able to provide the best user experience for data tiering(hot/warm/cold), we should be able to seamlessly migrate the cluster from hot to warm. To support this, here are few key decisions:

For seamless transition from hot to warm, we'll can use CompositeDirectory & FileCache for hot files as well. From Composite Directory perspective, it should be able to operate with a dynamic index level setting which defines the percentage of data to retain on disk(100% being hot).
- This is useful when we're hosting the warm & hot indices on same node. For dedicated warm node setup, tracking hot files in FileCache doesn't bring any advantage.
- Would require additional efforts to ensure that the FileCache doesn't become the bottleneck for hot shards.
Similarly, we would be using the same engine as hot index(InternalEngine for primary & NRTReplicationEngine for replica). We might still need to change few characteristics for warm shard(e.g. refresh interval, translog rollover duration, merge policy thresholds, etc) Note: Migration strategy to be covered in a separate issue.

Supporting custom codecs

One caveat with prefetching is that to prefetch certain blocks you might need to read the files using methods which are not really exposed by the respective readers in lucene. Few examples:

To be able to fetch header & footer blocks for compound files, we'll need to read the entry file and identify all the index files & their offsets.
For optimizing search path, we might want to prefetch the StoredFields for certain documents which requires getting the file offset for the document ID. These use case require us to support methods which aren't exposed thus requiring us to maintain lucene version specific logic to support the prefetch operation. This restriction also breaks the compatibility of prefetch logic with custom plugins. To get around this, we can disable prefetch for custom codes but it'll essentially restricts users from using other codecs. Alternatively(preferred), we can ask the codec owners to support the prefetch logic. We can expose the required methods behind a new plugin interface that should be implemented by codec owner if they want to support warm indices.

Future Enhancements

Here are some enhancements to explore after the performance benchmarking.

Exploring optimal merge policy for warm shards. Either we can tune the existing TieredMergePolicy with separate merge settings for warm shard. Or, if required, we can explore a different policy that minimizes the number of segments on remote store.
Exploring other caching strategies instead of LRU. LRU isn't resistant to overscan(segment merges) and can potentially impact the cache effectiveness. We can explore other alternatives like ARC and LIRS(Thanks @Bukhtawar for recommendation)
We can track the working set for each shard and persist the information in remote store metadata. Working set for a shard can be the set of most used file blocks. During shard initialization/relocation, the working set can be downloaded before marking the shard as active.

FAQs:

Q: How are files evicted from disk? Files can be evicted from disk due to:

Cache overflow: Apart from node level usage, we can also track the shard level usage and evict older files.
TTL: Depending on use case, user might want either of below option:
- Reduce the shard footprint over time with TTL based evictions.
- Retain the files locally as long as the FileCache has available space.

Q: How would merges work for warm shard? Merges can require downloading the files from remote. Eventually we'd want to have support for offline merge on a dedicated node(related issue).

Q: How to determine the essential data to be fetched for a shard? We can actively track the most used block files based on FileCache statistics. This information can be retained in remote store which creates the set of essential files to be downloaded while initializing a shard.

@sohami @mch2 @neetikasinghal @andrross

Thanks for creating this.

File Cache: File Cache will track the lucene files for all shards(using composite directory) at node level. Implementation for FileCache is already present (from searchable snapshot integration). We'll need to build support for additional features like cache eviction, file pinning, tracking hot indices, etc.

Keeping file cache same for hot/warm indices sounds great but currently the cache has locking semantics. Probably that will impact the existing hot indices performance specially around search since write will mean creating new segments at refresh interval ? I think this will help us to decide if FileCache should track hot indices files as well or not. One benefit with common FileCache is to manage the entire local disk space dynamically between hot/warm indices instead of defining a static configs. But probably we can achieve that by keeping a common tracking component for disk usage between hot/warm indices. FileCache can manage warm indices related local files and provide mechanisms to trigger eviction when invoked by this tracking component along with exposing the usage statistics ? This can be later used by disk usage based backpressure monitors or other consumers like migration components to decide if more indices can be moved to warm tier or not (both dedicated vs non-dedicated setup).

Index Inputs: For all read operations composite directory should be able to create/send different index inputs. For e.g:

This will be dependent on index type or specific file of an index ? For hot if all data is locally present then it will use IndexInput. For warm if data is locally or downloaded on request time won't it use the same BlockIndexInput ?

Prefetcher: To maintain the optimal read & write performance, we want to cache some part of data to improve various shard level operations(read, write, flush, merges, etc). This component should be able to trigger prefetching of certain files/blocks when the shard opens or the replica refreshes.

Not clear if this component will be used to prefetch these files before making a shard active or it will be sort of async prefetch and shard will be activated irrespective of files are prefetched or not. One thing which will be useful would be to evaluate what all shard level data/metadata is must-have to make it active and ready to serve different type of requests like search/indexing and including requests for segment/index level stats. Otherwise these requests may end up waiting on the download/prefetch to complete (which could take time when multiple shards are performing prefetch) and eventually fail even though shard is deemed active.

For seamless transition from hot to warm, we'll can use CompositeDirectory & FileCache for hot files as well. From Composite Directory perspective, it should be able to operate with a dynamic index level setting which defines the percentage of data to retain on disk(100% being hot).

Users can still create hot remote store based indices without composite directory or we will make composite directory the default for such indices along with mechanism to migrate existing indices to use the composite directory ?

[Triage - attendees 1 2 3 4 5 6 7] @ankitkala Thanks for creating this RFC, looking forward to seeing how this evolves.

Keeping file cache same for hot/warm indices sounds great but currently the cache has locking semantics. Probably that will impact the existing hot indices performance specially around search since write will mean creating new segments at refresh interval ? I think this will help us to decide if FileCache should track hot indices files as well or not.

We'll need to move away from Segmented Cache with LRU level write lock. We can use File level locking mechanisms instead which will ensure that there isn't any contention.

One benefit with common FileCache is to manage the entire local disk space dynamically between hot/warm indices instead of defining a static configs. But probably we can achieve that by keeping a common tracking component for disk usage between hot/warm indices.

Regarding the common tracking component for disk usage, that is definitely one alternative we were considering. Can be discussed in more details when we start working on supporting warm indices without dedicated warm nodes.

FileCache can manage warm indices related local files and provide mechanisms to trigger eviction when invoked by this tracking component along with exposing the usage statistics ? This can be later used by disk usage based backpressure monitors or other consumers like migration components to decide if more indices can be moved to warm tier or not (both dedicated vs non-dedicated setup).

Makes sense. We're starting with File Cache tracking only warm files for now. Exposing cache stats is also planned already which can be consumed by mechanisms that you're mentioned above.

Index Inputs: For all read operations composite directory should be able to create/send different index inputs. For e.g:

This will be dependent on index type or specific file of an index ? For hot if all data is locally present then it will use IndexInput. For warm if data is locally or downloaded on request time won't it use the same BlockIndexInput ?

Yes, this mostly depends on the state of the file. Apart from the 2 IndexInput that you mentioned,

we need separate IndexInput which will be put in the FileCache. Index Input returned by the directory would be a wrapper over this (one index input should be able to prefetch multiple blocks which internally can be separate index inputs in file cache).
We also need IndexInput for newly created files to ensure that these are not evict-able unless the file has been uploaded to the remote store. We can discuss this in detail with each individual issue.

Prefetcher: To maintain the optimal read & write performance, we want to cache some part of data to improve various shard level operations(read, write, flush, merges, etc). This component should be able to trigger prefetching of certain files/blocks when the shard opens or the replica refreshes.

Not clear if this component will be used to prefetch these files before making a shard active or it will be sort of async prefetch and shard will be activated irrespective of files are prefetched or not. One thing which will be useful would be to evaluate what all shard level data/metadata is must-have to make it active and ready to serve different type of requests like search/indexing and including requests for segment/index level stats. Otherwise these requests may end up waiting on the download/prefetch to complete (which could take time when multiple shards are performing prefetch) and eventually fail even though shard is deemed active.

Intention was to encapsulate the prefetch related logic here. For essential data, we'll start with not prefetching anything and then iterating on it based on performance benchmarking. This should help in deterministically define the essential files to prefetch before marking shard as active.

For seamless transition from hot to warm, we'll can use CompositeDirectory & FileCache for hot files as well. From Composite Directory perspective, it should be able to operate with a dynamic index level setting which defines the percentage of data to retain on disk(100% being hot).

Users can still create hot remote store based indices without composite directory or we will make composite directory the default for such indices along with mechanism to migrate existing indices to use the composite directory ?

We'll need to make composite directory as the default if we really want to support seamless hot/warm migration. One way to do this can be behind a cluster level setting which enables this for all hot remote backed indices.

Regarding the common tracking component for disk usage, that is definitely one alternative we were considering.

From my understanding, having default composite directory for hot/warm indices vs using FileCache for both seems to be 2 different independent items. I am fine with keeping composite as default (while allowing existing indices to seamlessly move towards that). However, for FileCache lets call out this alternative and we can evaluate both of these. With file level locks there could be additional overhead of locks per file and managing those (probably some benchmark will help to make these choices). So if the other alternative doesn't have any serious cons then we can use that.

Yes, this mostly depends on the state of the file. Apart from the 2 IndexInput that you mentioned,

we need separate IndexInput which will be put in the FileCache. Index Input returned by the directory would be a wrapper over this (one index input should be able to prefetch multiple blocks which internally can be separate index inputs in file cache).

I was assuming BlockIndexInput or the existing FileCachedIndexInput is the one which is essentially a wrapper in FileCache over block level files. IndexInput being able to prefetch multiple blocks probably seems to be some sort of optimizations for merge case ? Otherwise, merge process should just go over different sections of the segment files and download on demand and perform the merge as it goes since from lucene side the contracts are defined in terms of Directory and IndexInputs/Outputs for any IO.

We also need IndexInput for newly created files to ensure that these are not evict-able unless the file has been uploaded to the remote store.

Got it. This seems more of a FileCache semantic, to provide pinning mechanism for some of the key. Can it be implemented in the cache itself instead of exposing to IndexInput layer. That way the mechanism will be generic and independent on the value type stored in the cache.

We can discuss this in detail with each individual issue.

Sure

@ankitkala @sohami Regarding the locking semantics of the FileCache, we need to ensure we're not over-designing here. Conceptually, the file cache is an in-memory structure that maintains a small amount of state for each file on disk (a size, a reference count, some kind of ordering to implement LRU, etc). We have a microbenchmark that shows we can get over 1000 operations per millisecond out of a single segment of this cache. My intuition is that the overhead of maintaining the file cache accounting is pretty negligible in the grand scheme of things.

For optimizing search path, we might want to prefetch the StoredFields for certain documents which requires getting the file offset for the document ID. These use case require us to support methods which aren't exposed thus requiring us to maintain lucene version specific logic to support the prefetch operation.

@ankitkala will there be a new codec that will be added to support writable warm tier? By reading this, I feel we might be adding new codec which will for most of the code use the Lucene Codec but will also expose functions for doing prefetch. Is this understanding correct?

opensearch-project / OpenSearch