Is your feature request related to a problem? Please describe

This proposal aims at presenting the different design choices for tiering the index from hot to warm where warm indices are writable and the proposed design choice. The proposal described below is related to the RFC https://github.com/opensearch-project/OpenSearch/issues/12809. The tiering APIs to provide the customer experience have already been discussed in https://github.com/opensearch-project/OpenSearch/issues/12501

Describe the solution you'd like

Hot to Warm Tiering

API: POST /<indexNameOrPattern>/_tier/_warm

{
 "" : ""  // supporting body to make it extensible for future use-cases
}

Response:

Success:

{
"acknowledged": true
}

Failure:

{
    "error": {
        "root_cause": [
            {
                "type": "",
                "reason": "",
            }
        ],
    },
    "status": xxx
}

There are two cases presented in each of the below design choices - Case 1: Dedicated setup - cluster with dedicated warm nodes Case 2: Non-Dedicated/Shared node setup - cluster without the dedicated warm nodes

DESIGN 1: Requests served via cluster manager node with cluster state entry and push notification from shards

In this design, custom cluster state entry is introduced to store the metadata of the indices under-going tiering. In the non-dedicated setup case, the hot-warm node listens to the cluster state update and on detecting the change for index.store.data_locality (introduced in the PR) change from FULL to PARTIAL, the shard level updates on the composite directory is triggered. Allocation routing settings help in relocating the shards from data node to search dedicated nodes. The index locality setting value set to partial helps in initializing the shard on the dedicated node as a PARTIAL shard.

Pros

Master is able to deduplicate the requests by checking the cluster state entry
Cluster state entry is useful in tracking the status of the tiering
On master flips, the accepted tiering requests are not lost as the custom cluster state is replicated across all the master nodes

Cons

There will be a limitation in the number of indices supported for migration at a time depending on the space occupied by each cluster state entry and the number of cluster state updates.
On the master going down, the notification from the shard can be missed which could mark the migration as stuck while its actually completed in the non-dedicated nodes setup.
Accepted migration requests are lost on the failure or reboot of all the master nodes
Its difficult for TieringService to track failures and trigger retires in the shared node setup case and might need another cluster state update for retries. Or the other way would be to have an API to trigger retries for the data locality update that exposes two ways of doing the same thing - one via detecting a cluster state change and other via an API.

(Preferred) DESIGN 2: Requests served via cluster manager node with no cluster state entry and internal API call for shard level update

In this design, there is no custom cluster state entry stored in the cluster state. In case of dedicated warm nodes setup, TieringService adds allocation settings to the index (index.routing.allocation.require._role : search, index.routing.allocation.exclude._role : data) along with the other tiering related settings as shown in the diagram below. When the re-route is triggered, the allocation deciders run and decides to relocate the shard from hot node to warm node. index.store.data_locality: PARTIAL helps in initializing the shard in the PARTIAL mode on the warm node during shard relocation. In non-dedicated setup, there is a new API (more details will be provided in a separate issue) that is used to trigger the shard level update on the Composite Directory. This api can also be used for retrying in case of failures.

To track the shard level details of the tiering, the status api would give more insights on relocation status in case of dedicated nodes setup and shard level updates in the non-dedicated setup. More details on the status API would be covered in the follow-up issue.

Pros

No cluster state entry is kept for migrations, so there is some space we save on not keeping the custom cluster state
On master flips or failure or reboot of all master nodes, the accepted migration requests are not lost
Failures or timeouts are easily tracked via tiering service and retries are triggered via the same API used for data locality update.

Cons

There could be still a limit in the number of parallel migrations supported at a time as there are still 2 or more cluster states happening per requested index
Since there is no custom cluster state entry, for checking the indices under-going migration, the entire list of indices need to be traversed to find the in progress indices. However, this can be optimized by keeping in-memory metadata at the master node that needs to be re-computed on master switch or restart.

DESIGN 3: Requests served via cluster manager node with no Cluster state entry and Polling Service

In this design, instead of relying on the notification from the shards, there is a polling service that runs periodically and checks for the status of the shards of the indices under-going tiering.

About Tiering Polling Service Tiering polling service is a service that runs on the master node on a schedule after every x seconds defined by the interval settings. There can be a cluster wide dynamic setting to configure the value of the interval of polling. The polling interval can be different for dedicated and shared node setup as the migrations are expected to be faster in the shared node setup. Polling service begins by checking if there are any in progress migrations. If there are no in progress migrations, the polling service doesn’t do anything. If there are any in progress index migrations, the polling service calls an API on the composite directory to check for each of the shards status for the in progress migrations. On success of all the shards for the indices, the polling service updates the index settings of the successful indices.

The caveat with this design is that that with multiple indices in the in-progress state, the polling service has to call the status API to check the status for all shards of the in-progress indices. This could contain the shards of the indices which were already successful in the previous run of the polling service, however since one or two shards were still in the relocating state, the status check has to be re-done in the next run. To optimize this, we can store some in-memory metadata to save the information of the in-progress indices and the status of the shards in the indices. However, this metadata will be lost on a master switch or a master going down. Design 4 tries to deal with this limitation.

Pros

On master going down, the migration state will still be updated as it doesn’t rely on the notification. This would rule out the cases which are successful but marked stuck.
Polling service can club the cluster state updates for indices that had an update in the given interval

Cons

On one shard in the stuck state, the polling service needs to trigger an API call for all the other shards which are already successful and found successful in the previous run.
Determining the polling interval can be tricky and needs performance benchmarking to have the intervals. This could vary for different domains depending on the migration traffic.
Since there is no custom cluster state entry, for checking the indices under-going migration, the entire list of indices need to be traversed to find the in progress indices. However, this can be optimized by keeping in-memory metadata at the master node that needs to be re-computed on master switch or restart.

DESIGN 4: Requests served via cluster manager node with Cluster state entry and Polling Service

This design is similar to design 3 except that there is a custom cluster state entry to store the in progress migrations to prevent the need to keep the local metadata on the master node and avoid re-computation of the metadata on the master node switch.

Pros

On a master switch, the custom cluster state is still persisted which prevents the status of the shards to be re-checked
On master going down, the migration state will still be updated as it doesn’t rely on the notification. This would rule out the cases which are successful but marked stuck.
The cluster state entry can be used to give the status of the in progress migrations to the user
Polling service can club the cluster state updates for indices that had an update in the given interval

Cons

The cluster state stores the shard level status which can increase the overall number of cluster state updates per requested index
There would be a limitation in the number of in progress migration entries supported at time to prevent cluster state hogging up.
On the failure/restart of all the master nodes, the custom cluster state is lost and needs to be re-computed.
Determining the polling interval can be tricky and needs performance benchmarking to have the intervals. This could vary for different domains depending on the migration traffic.

Open Questions and next steps

List down the set of validations to be done for validating the request and check if these can be done on the co-ordinator node instead of master node.
Determine the number of parallel migrations that can be supported at a time for the chosen design
More details to follow on the Warm to Hot Tiering and the Tracking the status of the tiering

Related component

Search:Remote Search

Would love to get feedback from the community for this: @andrross @sohami @reta @mch2 @shwetathareja @Bukhtawar @rohin @ankitkala

@neetikasinghal thanks for sharing the different design options.

nitpick : Please use ClusterManager in place of Master terminology.

Trying to understand Option 2 which is your preferred choice:

While index.store.data_locality helps in initializing the shard in the PARTIAL mode in case of the dedicated nodes setup, its not used as a trigger in the non-dedicated setup. In non-dedicated setup, there is a new API

Dedicated here refers to dedicated warm node? Ideally, the tiering logic shouldn't be tightly coupled with cluster configuration whether it has dedicated warm nodes or not. Dedicated warm node basically dictate shard allocation logic on whether a shard stays on the current node post tiering or move to another node as it is not eligible for that node anymore

Since there is no custom cluster state entry, for checking the indices under-going migration, the entire list of indices need to be traversed to find the in progress indices. However, this can be optimized by keeping in-memory metadata at the master node that needs to be re-computed on master switch or restart.

when is this accessed, at the time of migration status API or otherwise as well ?

thanks @shwetathareja for your feedback.

Trying to understand Option 2 which is your preferred choice:

While index.store.data_locality helps in initializing the shard in the PARTIAL mode in case of the dedicated nodes setup, its not used as a trigger in the non-dedicated setup. In non-dedicated setup, there is a new API

Dedicated here refers to dedicated warm node? Ideally, the tiering logic shouldn't be tightly coupled with cluster configuration whether it has dedicated warm nodes or not. Dedicated warm node basically dictate shard allocation logic on whether a shard stays on the current node post tiering or move to another node as it is not eligible for that node anymore

Yes dedicated refers to dedicated warm nodes. In case of dedicated warm nodes setup, TieringService adds allocation settings to the index (index.routing.allocation.require._role : search, index.routing.allocation.exclude._role : data) along with the other tiering related settings as shown in the diagram above for design 2. When the re-route is triggered, the allocation deciders run and decides to relocate the shard from hot node to warm node. index.store.data_locality: PARTIAL helps in initializing the shard in the PARTIAL mode on the warm node during shard relocation. Hope this clarifies, let me know if you have further questions.

Since there is no custom cluster state entry, for checking the indices under-going migration, the entire list of indices need to be traversed to find the in progress indices. However, this can be optimized by keeping in-memory metadata at the master node that needs to be re-computed on master switch or restart.

when is this accessed, at the time of migration status API or otherwise as well ?

Correct, this is accessed at the time of tiering status API.

Focusing on the option 2:

No cluster state entry is kept for migrations, so there is some space we save on not keeping the custom cluster state

Do you have any idea what the magnitude of space saving here would be? New Index settings themselves are already in the cluster state so it seems like this would be pretty close to a net 0 change. Also the in-memory metadata you are proposing under the "Cons" section seems like it would basically cancel out any savings here.

Since there is no custom cluster state entry, for checking the indices under-going migration, the entire list of indices need to be traversed to find the in progress indices. However, this can be optimized by keeping in-memory metadata at the master node that needs to be re-computed on master switch or restart.

This might be too in the details for the purposes of this issue, but it seems to me that you would need some sort of process to constantly iterate over all of the indices in order to monitor the migration status (which I understand is basically what option 3 is describing). At a high level, how would you know if something is not working for one of your state transitions or how would handle state transition failures? For example if there is some sort of allocation decider issue preventing shard relocation then how would we handle that?

Aside from this, it would be really helpful to add some reasons for why option 2 is preferred since (at least for me) it's hard to understand the judgement call from just the list of pros/cons.

Thanks for the proposal @neetikasinghal I agree with approach 2 of using internal API calls for shard level update. Like you pointed out, it should be easier to track failures and setup retries with this approach.

Few question:

On master flips or failure or reboot of all master nodes, the accepted migration requests are not lost

How are we ensuring this with Option 2?

Also, since we aren't maintaining any metadata now, I was wondering how would we figure out which shards are warm? Yes, we can check the index level setting. But i wanted to check your thought on how to support something like _cat/indices/_warm. Do we leverage the in-memory data on cluster manager node?

Focusing on the option 2:

No cluster state entry is kept for migrations, so there is some space we save on not keeping the custom cluster state

Do you have any idea what the magnitude of space saving here would be? New Index settings themselves are already in the cluster state so it seems like this would be pretty close to a net 0 change. Also the in-memory metadata you are proposing under the "Cons" section seems like it would basically cancel out any savings here.

Storing the in progress entries in custom cluster state entry would also involve storing of index metadata like index name and index uuid at the minimum to be stored which is already present in the Index metadata and would be duplicated in the custom cluster state entry. This could mean some savings in the space for one node, however since the cluster state is replicated across all the nodes in the cluster, the savings would definitely be more. The storing of metadata is a suggestion for optimization and the metadata is kept only on the master node and not on all the nodes unlike cluster state update. In order to store the shard level metadata as well, the number of cluster state updates would definitely be more in custom cluster state entry as each shard level update would trigger a cluster state update compared to an update in the local metadata of the master node. Hope that answers your question.

Since there is no custom cluster state entry, for checking the indices under-going migration, the entire list of indices need to be traversed to find the in progress indices. However, this can be optimized by keeping in-memory metadata at the master node that needs to be re-computed on master switch or restart.

This might be too in the details for the purposes of this issue, but it seems to me that you would need some sort of process to constantly iterate over all of the indices in order to monitor the migration status (which I understand is basically what option 3 is describing). At a high level, how would you know if something is not working for one of your state transitions or how would handle state transition failures? For example if there is some sort of allocation decider issue preventing shard relocation then how would we handle that?

TieringService would have the logic to deal with all the scenarios. Let me explain from design 2's perspective - for non-dedicated setup, the TieringService is triggering an API to the composite directory for the shard level updates, so if there is any failure, the TieringService is aware of the failure and has the capability to deal with the failures either by a retry or marking tiering as failed. For the dedicated node setup, Tiering Service listens to the shard relocation completion notification from each of the shard. Tiering Service has the knowledge about all the shards for a given index. If there is, say one shard not able to complete the relocation, Tiering Service can figure out the reason of stuck relocation by calling the _cluster/allocation/explain for the stuck shard and give the reason in the status api output. Or, the other option is to keep the status as running shard relocation and customers need to call the cluster allocation explain api to find the reason of the stuck allocation. Thoughts?

Aside from this, it would be really helpful to add some reasons for why option 2 is preferred since (at least for me) it's hard to understand the judgement call from just the list of pros/cons.

Number of cluster state updates, saving on some space as called out above, being able to retry for the failures in non-dedicated setup, not losing the accepted migration requests, not needing to determine the polling interval (as in option 3/4) and not overwhelming the cluster with too many get calls done by the polling service makes option 2 as preferred. These points are already called out in the pros/cons though. Let me know if you have any specific question regarding any pros/cons of any of the design choices.

Thanks for the proposal @neetikasinghal I agree with approach 2 of using internal API calls for shard level update. Like you pointed out, it should be easier to track failures and setup retries with this approach.

Few question:

On master flips or failure or reboot of all master nodes, the accepted migration requests are not lost

How are we ensuring this with Option 2?

For each of the accepted request, there is a cluster state update triggered which updates the index settings. The cluster state is replicated across all the master nodes, so base on the index settings, TieringService should be able to find out the all accepted requests and be able to re-construct the lost metadata on master flip/restart.

Also, since we aren't maintaining any metadata now, I was wondering how would we figure out which shards are warm? Yes, we can check the index level setting. But i wanted to check your thought on how to support something like _cat/indices/_warm. Do we leverage the in-memory data on cluster manager node?

_cat/indices already has a transport api call to get index settings to get all the index level settings. We would get index.tiering.state as well as part of the get call for index settings that can be used to show the tiering of index as hot/warm in the _cat/indices output.

I would avoid a polling service to start with, I gave a basic pass wondering how is this substantially different from a snapshot custom entry where different shards undergo snapshots at various points?

@neetikasinghal Thanks for the proposal. Couple of suggestion:

In the preferred approach, I think we can start with not having an explicit API for the mode switch between full/partial, instead we can have an internal service component at data node level which can provide the same functionality. This component can trigger shard level actions based on index setting update in the cluster state (similar to ShardSnapshotService) and can also differentiate between dedicated vs non-dedicated setup as the need be. This way we will not need to think about API behavior outside of the tiering service which could be confusing as it is more of an internal details.
For shard allocation of the warm index type. I think we should have that logic in the allocation decider based on the index setting which differentiate hot/warm index (i.e. index.store.data_locality) and not based on require/exclude setting. This way we define the default behavior on where shards for this index will always be allocated and depending on node configuration it will choose right set of nodes. With idx setting based approach, users are free to change it and can create confusions or issues when shards don't move.
We will need to enumerate existing APIs to understand where all this tiering dimension can be added and will make sense for users. This can be a separate issue.

I would avoid a polling service to start with, I gave a basic pass wondering how is this substantially different from a snapshot custom entry where different shards undergo snapshots at various points?

@Bukhtawar the proposed approach is not have a custom cluster state entry, rather rely on the index settings (that are updated on the cluster state) and in-memory metadata on the master node to store metadata of the shards. Please refer to design 2 for more details and let me know in case you have further questions.

[Triage - attendees 1 2 3 4 5 6 7 8] @neetikasinghal Thanks for creating this RFC

opensearch-project / OpenSearch

[RFC] Tiering/Migration of indices from hot to warm where warm indices are mutable #13294

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Hot to Warm Tiering

DESIGN 1: Requests served via cluster manager node with cluster state entry and push notification from shards

(Preferred) DESIGN 2: Requests served via cluster manager node with no cluster state entry and internal API call for shard level update

DESIGN 3: Requests served via cluster manager node with no Cluster state entry and Polling Service

DESIGN 4: Requests served via cluster manager node with Cluster state entry and Polling Service

Open Questions and next steps

Related component