opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.78k stars 1.82k forks source link

[RFC] Tiering/Migration of indices from hot to warm where warm indices are mutable #13294

Open neetikasinghal opened 6 months ago

neetikasinghal commented 6 months ago

Is your feature request related to a problem? Please describe

This proposal aims at presenting the different design choices for tiering the index from hot to warm where warm indices are writable and the proposed design choice. The proposal described below is related to the RFC https://github.com/opensearch-project/OpenSearch/issues/12809. The tiering APIs to provide the customer experience have already been discussed in https://github.com/opensearch-project/OpenSearch/issues/12501

Describe the solution you'd like

Hot to Warm Tiering

API: POST /<indexNameOrPattern>/_tier/_warm

{
 "" : ""  // supporting body to make it extensible for future use-cases
}

Response:

Success:

{
"acknowledged": true
}

Failure:

{
    "error": {
        "root_cause": [
            {
                "type": "",
                "reason": "",
            }
        ],
    },
    "status": xxx
}

There are two cases presented in each of the below design choices - Case 1: Dedicated setup - cluster with dedicated warm nodes Case 2: Non-Dedicated/Shared node setup - cluster without the dedicated warm nodes

DESIGN 1: Requests served via cluster manager node with cluster state entry and push notification from shards

In this design, custom cluster state entry is introduced to store the metadata of the indices under-going tiering. In the non-dedicated setup case, the hot-warm node listens to the cluster state update and on detecting the change for index.store.data_locality (introduced in the PR) change from FULL to PARTIAL, the shard level updates on the composite directory is triggered. Allocation routing settings help in relocating the shards from data node to search dedicated nodes. The index locality setting value set to partial helps in initializing the shard on the dedicated node as a PARTIAL shard.

image

Pros

Cons

(Preferred) DESIGN 2: Requests served via cluster manager node with no cluster state entry and internal API call for shard level update

In this design, there is no custom cluster state entry stored in the cluster state. In case of dedicated warm nodes setup, TieringService adds allocation settings to the index (index.routing.allocation.require._role : search, index.routing.allocation.exclude._role : data) along with the other tiering related settings as shown in the diagram below. When the re-route is triggered, the allocation deciders run and decides to relocate the shard from hot node to warm node. index.store.data_locality: PARTIAL helps in initializing the shard in the PARTIAL mode on the warm node during shard relocation. In non-dedicated setup, there is a new API (more details will be provided in a separate issue) that is used to trigger the shard level update on the Composite Directory. This api can also be used for retrying in case of failures.

To track the shard level details of the tiering, the status api would give more insights on relocation status in case of dedicated nodes setup and shard level updates in the non-dedicated setup. More details on the status API would be covered in the follow-up issue.

image

Pros

Cons

DESIGN 3: Requests served via cluster manager node with no Cluster state entry and Polling Service

In this design, instead of relying on the notification from the shards, there is a polling service that runs periodically and checks for the status of the shards of the indices under-going tiering.

About Tiering Polling Service Tiering polling service is a service that runs on the master node on a schedule after every x seconds defined by the interval settings. There can be a cluster wide dynamic setting to configure the value of the interval of polling. The polling interval can be different for dedicated and shared node setup as the migrations are expected to be faster in the shared node setup. Polling service begins by checking if there are any in progress migrations. If there are no in progress migrations, the polling service doesn’t do anything. If there are any in progress index migrations, the polling service calls an API on the composite directory to check for each of the shards status for the in progress migrations. On success of all the shards for the indices, the polling service updates the index settings of the successful indices.

The caveat with this design is that that with multiple indices in the in-progress state, the polling service has to call the status API to check the status for all shards of the in-progress indices. This could contain the shards of the indices which were already successful in the previous run of the polling service, however since one or two shards were still in the relocating state, the status check has to be re-done in the next run. To optimize this, we can store some in-memory metadata to save the information of the in-progress indices and the status of the shards in the indices. However, this metadata will be lost on a master switch or a master going down. Design 4 tries to deal with this limitation.

image

Pros

Cons

DESIGN 4: Requests served via cluster manager node with Cluster state entry and Polling Service

This design is similar to design 3 except that there is a custom cluster state entry to store the in progress migrations to prevent the need to keep the local metadata on the master node and avoid re-computation of the metadata on the master node switch.

image

Pros

Cons

Open Questions and next steps

Related component

Search:Remote Search

neetikasinghal commented 6 months ago

Would love to get feedback from the community for this: @andrross @sohami @reta @mch2 @shwetathareja @Bukhtawar @rohin @ankitkala

shwetathareja commented 6 months ago

@neetikasinghal thanks for sharing the different design options.

nitpick : Please use ClusterManager in place of Master terminology.

Trying to understand Option 2 which is your preferred choice:

While index.store.data_locality helps in initializing the shard in the PARTIAL mode in case of the dedicated nodes setup, its not used as a trigger in the non-dedicated setup. In non-dedicated setup, there is a new API

Dedicated here refers to dedicated warm node? Ideally, the tiering logic shouldn't be tightly coupled with cluster configuration whether it has dedicated warm nodes or not. Dedicated warm node basically dictate shard allocation logic on whether a shard stays on the current node post tiering or move to another node as it is not eligible for that node anymore

Since there is no custom cluster state entry, for checking the indices under-going migration, the entire list of indices need to be traversed to find the in progress indices. However, this can be optimized by keeping in-memory metadata at the master node that needs to be re-computed on master switch or restart.

when is this accessed, at the time of migration status API or otherwise as well ?

neetikasinghal commented 6 months ago

thanks @shwetathareja for your feedback.

Trying to understand Option 2 which is your preferred choice:

While index.store.data_locality helps in initializing the shard in the PARTIAL mode in case of the dedicated nodes setup, its not used as a trigger in the non-dedicated setup. In non-dedicated setup, there is a new API

Dedicated here refers to dedicated warm node? Ideally, the tiering logic shouldn't be tightly coupled with cluster configuration whether it has dedicated warm nodes or not. Dedicated warm node basically dictate shard allocation logic on whether a shard stays on the current node post tiering or move to another node as it is not eligible for that node anymore

Yes dedicated refers to dedicated warm nodes. In case of dedicated warm nodes setup, TieringService adds allocation settings to the index (index.routing.allocation.require._role : search, index.routing.allocation.exclude._role : data) along with the other tiering related settings as shown in the diagram above for design 2. When the re-route is triggered, the allocation deciders run and decides to relocate the shard from hot node to warm node. index.store.data_locality: PARTIAL helps in initializing the shard in the PARTIAL mode on the warm node during shard relocation. Hope this clarifies, let me know if you have further questions.

Since there is no custom cluster state entry, for checking the indices under-going migration, the entire list of indices need to be traversed to find the in progress indices. However, this can be optimized by keeping in-memory metadata at the master node that needs to be re-computed on master switch or restart.

when is this accessed, at the time of migration status API or otherwise as well ?

Correct, this is accessed at the time of tiering status API.

jed326 commented 6 months ago

Focusing on the option 2:

No cluster state entry is kept for migrations, so there is some space we save on not keeping the custom cluster state

Do you have any idea what the magnitude of space saving here would be? New Index settings themselves are already in the cluster state so it seems like this would be pretty close to a net 0 change. Also the in-memory metadata you are proposing under the "Cons" section seems like it would basically cancel out any savings here.

Since there is no custom cluster state entry, for checking the indices under-going migration, the entire list of indices need to be traversed to find the in progress indices. However, this can be optimized by keeping in-memory metadata at the master node that needs to be re-computed on master switch or restart.

This might be too in the details for the purposes of this issue, but it seems to me that you would need some sort of process to constantly iterate over all of the indices in order to monitor the migration status (which I understand is basically what option 3 is describing). At a high level, how would you know if something is not working for one of your state transitions or how would handle state transition failures? For example if there is some sort of allocation decider issue preventing shard relocation then how would we handle that?

Aside from this, it would be really helpful to add some reasons for why option 2 is preferred since (at least for me) it's hard to understand the judgement call from just the list of pros/cons.

ankitkala commented 6 months ago

Thanks for the proposal @neetikasinghal I agree with approach 2 of using internal API calls for shard level update. Like you pointed out, it should be easier to track failures and setup retries with this approach.

Few question:

  • On master flips or failure or reboot of all master nodes, the accepted migration requests are not lost

How are we ensuring this with Option 2?

neetikasinghal commented 6 months ago

Focusing on the option 2:

No cluster state entry is kept for migrations, so there is some space we save on not keeping the custom cluster state

Do you have any idea what the magnitude of space saving here would be? New Index settings themselves are already in the cluster state so it seems like this would be pretty close to a net 0 change. Also the in-memory metadata you are proposing under the "Cons" section seems like it would basically cancel out any savings here.

Storing the in progress entries in custom cluster state entry would also involve storing of index metadata like index name and index uuid at the minimum to be stored which is already present in the Index metadata and would be duplicated in the custom cluster state entry. This could mean some savings in the space for one node, however since the cluster state is replicated across all the nodes in the cluster, the savings would definitely be more. The storing of metadata is a suggestion for optimization and the metadata is kept only on the master node and not on all the nodes unlike cluster state update. In order to store the shard level metadata as well, the number of cluster state updates would definitely be more in custom cluster state entry as each shard level update would trigger a cluster state update compared to an update in the local metadata of the master node. Hope that answers your question.

Since there is no custom cluster state entry, for checking the indices under-going migration, the entire list of indices need to be traversed to find the in progress indices. However, this can be optimized by keeping in-memory metadata at the master node that needs to be re-computed on master switch or restart.

This might be too in the details for the purposes of this issue, but it seems to me that you would need some sort of process to constantly iterate over all of the indices in order to monitor the migration status (which I understand is basically what option 3 is describing). At a high level, how would you know if something is not working for one of your state transitions or how would handle state transition failures? For example if there is some sort of allocation decider issue preventing shard relocation then how would we handle that?

TieringService would have the logic to deal with all the scenarios. Let me explain from design 2's perspective - for non-dedicated setup, the TieringService is triggering an API to the composite directory for the shard level updates, so if there is any failure, the TieringService is aware of the failure and has the capability to deal with the failures either by a retry or marking tiering as failed. For the dedicated node setup, Tiering Service listens to the shard relocation completion notification from each of the shard. Tiering Service has the knowledge about all the shards for a given index. If there is, say one shard not able to complete the relocation, Tiering Service can figure out the reason of stuck relocation by calling the _cluster/allocation/explain for the stuck shard and give the reason in the status api output. Or, the other option is to keep the status as running shard relocation and customers need to call the cluster allocation explain api to find the reason of the stuck allocation. Thoughts?

Aside from this, it would be really helpful to add some reasons for why option 2 is preferred since (at least for me) it's hard to understand the judgement call from just the list of pros/cons.

Number of cluster state updates, saving on some space as called out above, being able to retry for the failures in non-dedicated setup, not losing the accepted migration requests, not needing to determine the polling interval (as in option 3/4) and not overwhelming the cluster with too many get calls done by the polling service makes option 2 as preferred. These points are already called out in the pros/cons though. Let me know if you have any specific question regarding any pros/cons of any of the design choices.

neetikasinghal commented 6 months ago

Thanks for the proposal @neetikasinghal I agree with approach 2 of using internal API calls for shard level update. Like you pointed out, it should be easier to track failures and setup retries with this approach.

Few question:

  • On master flips or failure or reboot of all master nodes, the accepted migration requests are not lost

How are we ensuring this with Option 2?

For each of the accepted request, there is a cluster state update triggered which updates the index settings. The cluster state is replicated across all the master nodes, so base on the index settings, TieringService should be able to find out the all accepted requests and be able to re-construct the lost metadata on master flip/restart.

  • Also, since we aren't maintaining any metadata now, I was wondering how would we figure out which shards are warm? Yes, we can check the index level setting. But i wanted to check your thought on how to support something like _cat/indices/_warm. Do we leverage the in-memory data on cluster manager node?

_cat/indices already has a transport api call to get index settings to get all the index level settings. We would get index.tiering.state as well as part of the get call for index settings that can be used to show the tiering of index as hot/warm in the _cat/indices output.

Bukhtawar commented 6 months ago

I would avoid a polling service to start with, I gave a basic pass wondering how is this substantially different from a snapshot custom entry where different shards undergo snapshots at various points?

sohami commented 6 months ago

@neetikasinghal Thanks for the proposal. Couple of suggestion:

neetikasinghal commented 6 months ago

I would avoid a polling service to start with, I gave a basic pass wondering how is this substantially different from a snapshot custom entry where different shards undergo snapshots at various points?

@Bukhtawar the proposed approach is not have a custom cluster state entry, rather rely on the index settings (that are updated on the cluster state) and in-memory metadata on the master node to store metadata of the shards. Please refer to design 2 for more details and let me know in case you have further questions.

peternied commented 6 months ago

[Triage - attendees 1 2 3 4 5 6 7 8] @neetikasinghal Thanks for creating this RFC