[RFC] Reindex-from-Snapshot (RFS)

What/Why

In this RFC we propose to develop new, standardized tooling that will enable OpenSearch users to more easily move data between platforms and versions. The phrase we’ve been using to describe this approach so far is Reindex-from-Snapshot (RFS). At a high level, the idea is to take a snapshot of the source cluster, pull the source docs directly from the snapshot, and re-ingest them on the target using its REST API. This would add an additional option to the mix of traditional snapshot/restore and using the reindex REST API on the source cluster. It is hoped that this approach will combine many of the best aspects of the two existing methods of data movement while avoiding some of their drawbacks.

What users have asked for this feature?

Many users need to maintain a full copy of historical data in their ES/OS cluster (e.g. the “Search” use-case). For these users, moving to a new cluster entails moving that data, ideally with as little effort and as little impact on the source cluster as possible. Improving the experience for this process will enable users to upgrade their cluster versions more easily and move to new platforms more easily.

What problem are you trying to solve?

When moving historical data from a source cluster to a target cluster, there are two broad aspects that need to be considered: (1) moving the data to the target efficiently and (2) ensuring the moved data is usable on the target. (2) is especially important when the target cluster’s ES/OS major version does not match the source cluster. The two existing data movement solutions both have tradeoffs with regards to these aspects, which we will briefly explore below.

The goal of this proposal is to have a data movement solution that is both efficient and able to ensure the data is usable on the target, even if the target is beyond the Lucene backwards compatibility limit.

Existing Solution: Snapshot/Restore

Snapshot/Restore makes a copy of the source cluster at the filesystem level and packs the copy into format that can be more easily stored where the user desires (on disk, in network storage, in the cloud, etc). When the user wants to restore the copy, the target cluster has its nodes retrieve the portions of the snapshot relevant to them, unpack them locally, and load them via Lucene. The data in the snapshot is not re-ingested by ES/OS.

Snapshot/Restore is better at moving the data efficiently. It reduces strain on the source cluster by avoiding the cost of pulling all historical data through the REST API and enables the target cluster to stand up without having to re-ingest the historical data through its REST API. As a side benefit, it also enables users to rehydrate a cluster from a backup in cold storage. However, it is not an option if the major version of the target is no more than a single increment from the source (a limitation driven by Lucene backwards compatibility, see [1]). Additionally, once data has been moved to a newer major version using Snapshot/Restore (e.g. from v1.X to v2.X), it must be reindexed before it can be moved an even newer major version (e.g. from v2.X to v3.X); this is also due to Lucene backwards compatibility.

Existing Solution: Reindex REST API

The Reindex REST API allows users to set up an operation on the source cluster that will cause it to send all documents in the specified indices to a target cluster for re-ingestion. Some versions of ES/OS support the option of parallelizing this process within a given index using sliced scrolls [2]. This process operates at the application layer on both the source and target clusters.

The Reindex REST API on the source cluster is useful when the user needs to move to a target cluster beyond the Lucene backwards compatibility limit, or when snapshot/restore otherwise isn’t an option. Re-ingesting the source documents on the target using the reindex API bypasses the backwards compatibility issue by creating new Lucene indices on the target instead of trying to read the source Lucene indices in the target cluster. However, the faster you perform the data movement (such as by using sliced scrolls), the greater the impact on the ability of the source cluster to serve production traffic. Additionally, having to operate at the application layer means that the overhead of the distributed system comes into play rather than being able to transfer data at the filesystem level. Finally, the Reindex REST API is only usable for an index if the source cluster has retained an original copy of every document in that index via the _source feature [3].

[1] https://opensearch.org/docs/2.12/install-and-configure/upgrade-opensearch/index/#compatibility [2] https://www.elastic.co/guide/en/elasticsearch/reference/7.10/docs-reindex.html#docs-reindex-slice [3] https://www.elastic.co/guide/en/elasticsearch/reference/7.10/docs-reindex.html

How would the proposal solve that problem?

The core idea of RFS is to write a set of tooling to perform the following workflow:

Take a snapshot of the source cluster, using the standard API/process.
Run the tool. It takes in the location of the snapshot (local disk, S3, etc) and the ES/OS Indices to be restored, then parses the snapshot to find the relevant metadata files and snapshot blob-files in the snapshot.
Then for each shard of each ES/OS Index to be restored, the tool would:
- Copy the snapshot blob-files to disk (if they are not already local)
- Unpack them to their original Lucene file format
- Use Lucene to read the shard’s files as a Lucene Index and pull the original source documents from the _source field [1]
- PUT the source documents to the target cluster, allowing the target to re-ingest the document and create new Lucene Indices. The target cluster can be scaled up to facilitate re-ingestions, then scaled down to meet expected production demand.

This approach appears to combine the benefits of both snapshot/restore and using the reindex API on the source cluster. Most importantly, it removes the strain on the source cluster during data movement while also bypassing the Lucene backwards compatibility limit by re-ingesting the data on the target cluster. Additionally, it allows users to rehydrate a historical cluster’s dataset from a snapshot, potentially reducing the cost of a data movement by removing the need to have the source cluster running. Finally, it opens up the possibility of skipping the ES/OS snapshot entirely and operating directly from disk images (such as an EBS Volume snapshot), which some users have already had success with (see [2]).

[1] https://www.elastic.co/guide/en/elasticsearch/reference/7.10/mapping-source-field.html [2] https://blogs.oracle.com/cloud-infrastructure/post/behind-the-scenes-opensearch-with-oci-vertical-scaling

What could the user experience be like?

The user experience could be as follows:

The user creates a new target cluster of the desired ES/OS version on the desired platform. They would likely want to initially set the replica number to 0 to speed up ingestion by avoiding the need to duplicate to the data nodes from each shard’s master. They may also wish to temporarily spin up a set of dedicated ingestion nodes [1] to further speed up writes.
The user registers a new, empty snapshot repository on their source cluster. Making a new repository simplifies the process of parsing a snapshot’s details from the repository and increases the speed of creating the snapshot, as there is no need to consider or perform incremental snapshots. This repository would be in a network-accessible location so that the RFS tooling could easily access its contents.
The user makes a snapshot of the source cluster into the new, empty repository.
The user passes the RFS tooling the location/connection details/access permissions for both the snapshot and the target cluster, and tells it to “go”.
RFS will perform the steps required to parse the snapshot, retrieve the portions relevant at a given time, extract the _source documents from retrieved snapshot portions, and reindex them against the target cluster (as explained in more detail above).
The user reconfigures the target cluster to their “production” settings (probably add replicas, probably remove the additional ingest nodes)

[1] https://www.elastic.co/guide/en/elasticsearch/reference/7.10/ingest.html

What are the limitations/drawbacks of the proposal?

Community feedback is greatly welcomed on this point, but the following occur to the author:

_source Flag: The approach relies on having the original documents stored in the Lucene index via the _source field. If the user has disabled this feature on their index, it will not be possible to extract the original documents from the snapshot using Lucene, and therefore not be possible to re-play them against the target cluster. Without the original document, it becomes harder (impossible?) to move the data across multiple major versions at full fidelity. However, this is not a problem unique to the proposed solution; using the Reindex REST API has the same limitation.
Version Compatibility: The tooling may need to support a path from each version to each other version to reach its full potential. This could be initially simplified to just supporting one minor version of each major version as a source/target, and having users first upgrade to that minor version on their source. It is currently unclear whether this simplification would be sufficient long term, or how much work would be involved if that simplification is not sufficient. As a data point, the author was able to add support for ES 6.8 to his simplified prototype scripts about a day after first having support for OS 7.10.
Plugin Compatibility: Parsing snapshots requires custom classes to understand the metadata files and (potentially) the blobbed Lucene files in the snapshot. These classes are loaded into the Elasticsearch/OpenSearch process when the node stands up (see here [1]). We may need access to to the user’s plugins in order to parse their snapshots.

[1] https://github.com/elastic/elasticsearch/blob/6.8/server/src/main/java/org/elasticsearch/node/Node.java#L425

Have you tested this approach?

Yes. The author has some proof-of-concept scripts working that will take an Elasticsearch 6.8.23 snapshot [1] and move its contents to an OpenSearch 2.11 target cluster. The also has another version of the scripts [2] that unpacks an Elasticsearch 7.10.2 snapshot and performs replay against an Elasticsearch 7.10.2 target. As a PoC, there is obviously more development to be done to make them production-ready.

[1] https://github.com/chelma/reindex-from-snapshot/tree/6.8 [2] https://github.com/chelma/reindex-from-snapshot/tree/7.10

We probably don’t need to restore the full data on disk, we could do block-level fetch to get documents in parts, extract source field and then re-index to target. A better alternative that is being brainstormed if by running merge operation on existing segments with a new version of index writer to re-create segments in a newer version. From a quick glance this RFC is more of an operational tool, with tonnes of opportunity to solve the actual problem

Hope this is can be implemented as a OpenSource library which different clients can use in their own framework to migrate data.

Thanks @chelma for the proposal

Unpack them to their original Lucene file format Use Lucene to read the shard’s files as a Lucene Index and pull the original source documents from the _source field [1]

Are you saying instead of restoring an index completely, we restore only _source field and then trigger reindexing on target cluster?

Also what is main problem we are trying to solve here - version upgrade or customer wants to change mappings/ fields getting indexed? If it is former i.e. mainly for upgrade, there are other approaches which may be more efficient and we are planning to experiment like @Bukhtawar pointed out, where instead of paying the cost of parsing the document and then generate lucene data structure on the target index during re-ingestion. It will directly trigger upgrade of the lucene segment using new version IndexWriter via pseudo merge. If it is the latter problem where the index mappings are changed, then this could be useful.

What are the changes you expect in the core to implement this? Do you intend to write a separate plugin here?

Also, thinking offering snapshot restore with just _source field would help users in general who are looking to reindex and don't want to impact the source cluster where indices are hosted.

@Bukhtawar @shwetathareja

There are a few, overlapping problems we're trying to solve that have combined to lead us to this particular approach. I'll try to outline our problem set, hope it makes sense. Keep in mind that not every data movement requires all of the following points, but our solution should address them all to support more complex situations.

Enable Elasticsearch/OpenSearch users running clusters on-prem or self-managed on a cloud hosting provider (e.g. EC2) to move to a new "location". This entails shifting the historical data and ongoing traffic to a new cluster, which may or may not be on the previous platform (on-prem, self-managed cloud, managed cloud).
Enable Elasticsearch/OpenSearch users running clusters on-prem or self-managed on a cloud hosting provider (e.g. EC2) to move to a different version of Elasticsearch/OpenSearch - either an upgrade, a downgrade, or a "side-grade". The initial focus is on upgrades and side-grades. An example would be going from ES 6.8 to OS 2.11 (upgrade), or maybe ES 7.17 to OS 1.3 (side-grade).
To support (2), provide the opportunity and ability for users to transform their data and configuration, as appropriate, from the source version to something compatible with the target version.
Minimize the operational risk and financial cost of performing the movement. A typical movement, as we see it, likely requires a user to begin caching their existing traffic to the source, move their historical data to the target, replay the cached traffic against the target, feed new traffic to both source/target, then cut over. According to our research, it becomes very expensive very quickly to perform this cacheing. Reducing the overall time to perform the movement is therefor a big win, as it reduces both risk and cost. Removing impact to the source cluster's ability to serve traffic is also a win on both fronts.
Enable single-hop movements. Because these movements are so expensive (effort, money), the more you can perform in a single movement, the better. Being able to switch platforms while also jumping several major versions at the same time is much better than iterative solutions. Each iteration is just more time for things to go wrong, funding to be pulled, the business situation to change, etc.

Based on our research/understanding, the existing options for such movement are not ideal on one or more of those fronts. I discussed Snapshot/Restore and the Reindex REST API in the RFC above, but probably should have talked about iterative Lucene-level singleton merges as well. While iterative Lucene singleton merges do allow you to progress up through the version chain, from what I understand addressing ES/OS/Lucene feature incompatibilities or changes at the Lucene level (manipulating the raw Lucene index) will be much harder than at the Elasticsearch/OpenSearch level (manipulating JSON, for the most part). It will also be harder to include the user in that process because it will require much more expertise to perform. Including the user in the process seems valuable because it might not be obvious what the "right" answer for a transformation is. Additionally, iterative singleton merges are, definitionally, iterative - you have to do some work for each major version hop (or maybe every other?), instead of going directly to the version you want. As a hypothetical, what would the user experience be like with to go from ES 5.X to OS 2.X with singleton merges?

To answer another question - our current thought for the first cut of this is to make tooling completely outside the ES/OS stack and cluster to perform this work (no change to the existing OS codebase). This helps facilitate (4) above, by enabling greater parallelism without impacting the source cluster and increasing the speed of the movement. E.g. have a fleet of workers each be responsible for a single shard of a single index in the snapshot, pull the portion of the snapshot related to that shard, extract the _source, perform a transformation, and reindex on the temporarily scaled-up target. Fanning out to the shard level has major advantages, as the size of each shard is self-limiting and it decouples the duration of historical migration from the size of the overall data set being migrated; it only takes as long as the biggest shard. In this scenario (not running on the source/target nodes, a "capped" size to the work per worker, large costs associated with duration of historical data migration), the CPU/IO/network bandwidth usage of the workers appears less important of a consideration compared to the other factors.

That said - skipping the snapshot part entirely and just mounting a virtual disk from a filesystem image of the source cluster is a great optimization which we had planned to explore in later iterations. We didn't propose it as part of the initial cut because Snapshots provided a lingua franca that all users can speak - regardless of what platform they are starting on. This allows us to hit a broader portion of the community with our initial efforts, at the cost of some additional CPU/IO/bandwidth on the workers.

I hope that all makes sense, and really curious to get your thoughts!

@shwetathareja

Are you saying instead of restoring an index completely, we restore only _source field and then trigger reindexing on target cluster?

Realized I didn't respond to this question in the previous post. Yes, that is the idea. AFAIK, if the goal is to reindex, the _source doc is all you need. The best way I know to get those currently is to pull down each shard, treat it as a separate Lucene index, and extract the _source field from each doc in Lucene. We then batch the contents of the _source fields paired with the _id field into bulk PUTs to the target cluster and ship them along. The target cluster will then reindex them. FWIW, the Lucene side of this has proven pretty straightforward so far in my prototyping.

[Triage - attendees 1 2 3 4 5 6] @chelma Thanks for filing, looking forward to seeing more discussion here.

Thanks @chelma for your detailed response. I agree with different complex situations that you called above during migrations and we need tooling that can address them.

As a hypothetical, what would the user experience be like with to go from ES 5.X to OS 2.X with singleton merges?

With iterative singleton merges, these merges have to be run for every major version. Hence, an iterative process that could be running in the background based on merge policy and other configurations.

our current thought for the first cut of this is to make tooling completely outside the ES/OS stack and cluster to perform this work (no change to the existing OS codebase).

👍

Fanning out to the shard level has major advantages, as the size of each shard is self-limiting and it decouples the duration of historical migration from the size of the overall data set being migrated; it only takes as long as the biggest shard.

Parallelization at shard level would definitely speed up the historical data migration significantly. I am seeing this proposal as one click (fast) migration of historical data (where snapshot restore can't work due to certain requirements) without requiring to setup a cluster.

The best way I know to get those currently is to pull down each shard, treat it as a separate Lucene index, and extract the _source field from each doc in Lucene. We then batch the contents of the _source fields paired with the _id field into bulk PUTs to the target cluster and ship them along.

I would like to see performance no. at a shard level of this vs reindex API where it uses scroll to get documents. We can take this opportunity to improve the existing re-index API based on the learning from this.

Capturing an offline convo w/ @Bukhtawar for posterity:

There does appear to be a branch in experience based on whether the _source docs are available. If the _source is available, a reindexing approach like proposed in the RFC provides some value over a singleton merge based approach. If the _source doc isn't available, singleton merges appear to be the best approach.
The "limits" to shard size are more convention and best practice than a technical hard limit; while many shards will be no larger than 10's of GBs, some users may have multi-TB shards.
Fortunately, there's an optimization available that reduces the unit of work beyond the shard level (as mentioned by @shwetathareja above). In the original RFC, it was proposed to download the entire shard's data, unpack it, and then extract the _source docs from it. However, @Bukhtawar pointed out that only a subset of the files in the snapshot are needed to retrieve the _source docs. Additionally, if the snapshot is stored in a place like S3 where you can remotely read sections of files, these files can broken up/streamed in chunks to be processed by the workers. This means that the workers can each own a portion of the _source docs in a given shard, and sequentially retrieve them as needed from the source without needing to retrieve the entire shard's data upfront.

@shwetathareja Some thoughts/responses:

With iterative singleton merges, these merges have to be run for every major version. Hence, an iterative process that could be running in the background based on merge policy and other configurations.

That part makes sense; I'd be curious to hear your thoughts on what to do when a change in Elasticsearch/OpenSearch/Lucene behavior or features between major versions requires some manual effort to transform the data in the source to a format that matches the target. Maybe there's a type change, for example. Definitely not an expert on this topic, but people bring up stuff like geopoints which changed their representation. AFAIK, there's been other changes where some editorial discretion is required to navigate the upgrade. That brings up whether the user needs to be involved in the process. Do you know if this angle has been explored? It's something we're starting to think about and it would be awesome to hear that the problems is either easier than we currently believe or there's prior art we can avail ourselves of.

I would like to see performance no. at a shard level of this vs reindex API where it uses scroll to get documents. We can take this opportunity to improve the existing re-index API based on the learning from this.

As mentioned above, @Bukhtawar had a great point that we can split work beyond the shard level to portions of the _source docs in a shard. I think that has the potential to make a big impact on the performance by increasing the possible parallelization while reducing the storage needed on the workers and the quantity of work they actually need to perform. Will keep you posted as we get further along and get to testing relative performance.

when a change in Elasticsearch/OpenSearch/Lucene behavior or features between major versions requires some manual effort to transform the data in the source to a format that matches the target

@chelma this banks on the fact that Lucene itself provides backward compatibility with last major version. Hence, when the iterative empty merge would run for segments, it create the new lucene files with new IndexWriter. If there is a type change, it depends where the breaking change lies, in the software handling in OpenSearch or underlying storage format for lucene. Both, provides back compatibility with last major version. This idea has not be POCed yet. I will open a tracking github issue for this.

@chelma Congratulations, RFS is implemented and part of the Migration Assistant. With this in mind I am going to close out this RFC, thanks everyone for your comments.

opensearch-project / OpenSearch