opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.68k stars 1.79k forks source link

[Proposal] Tiering Status API Model and Design #14989

Open e-emoto opened 2 months ago

e-emoto commented 2 months ago

Is your feature request related to a problem? Please describe

The Status API in Tiering will be for listing the in-progress and failed index tierings. Since the Tiering project is still developing, the API should be extensible to cover new cases such as dedicated and non-dedicated warm node clusters. The design explanations here focus on the hot to warm case, but take the future use of the API into consideration.

Describe the solution you'd like

API Models:

The API will use a source and target as input to filter which tierings are shown. It will validate that both inputs are valid tiers, and then use them to find any tierings that match the described type. The API should still work if only one of the source or target is given, and will find any tierings with that input, allowing for more flexible queries. In the default case if no source or target is given as input, the status API should return all in progress or failed tierings for the specified indices, regardless of the tiering change. There will be two APIs for status: a GET API and a _cat API.

GET API:

API Request:

GET /<indexNameOrPattern>/_tiering?source=hot&target=warm GET /<indexNameOrPattern>/_tiering?state=active GET /<indexNameOrPattern>/_tiering?detailed=true/false

The GET API would have a few parameters. The index name in the path will be required, but can support using _all or * to get migrations from all indices that match the parameters. The API will also support comma separated index names.

API Parameters:

source = hot / warm (optional, no default) target = hot / warm (optional, no default)

The values for the source and target parameters are the tiers, with source being the tier the index started in and target being the tier it is moving to.

state = failed / active (optional, no default)

The values of the state parameter represent the state of the tiering. failed indicates that the tiering has failed and active means the tiering process is in progress.

detailed = true / false (default false)

The detailed parameter determines whether the GET API response should include details like the shard relocation status and tiering start time.

local = true / false (optional, default false)

The local parameter determines where the request retrieves information from. If true, it is from a data node, if false, it is from the master node.

API Response:

Success:

GET API
{
    "test1": {
        "source": "hot",
        "target": "warm",
        "state": "active",
        "duration": "10:00:00",
    }
}

GET API with detailed flag
{
    "test1": {
        "source": "hot",
        "target": "warm",
        "state": "active",
        "start_time": "2024-06-27T00:00:00Z",
        "duration": "10:00:00",
        "shards": {
            "total": 10, 
            "successful": 3, 
            "failed": 2, 
            "active": 5,
        },
    }
}

Failure:

{
    "error": {
        "root_cause": [
            {
                "type": "",
                "reason": "",
            }
        ],
    },
    "status": xxx
}

_cat API:

API Request:

/_cat/tiering?source=hot&target=warm /_cat/tiering?state=active

The _cat API would have some of the same parameters as GET, but would also have additional parameters for formatting and filtering the response columns.

API Parameters:

source = hot / warm (optional, no default) target = hot / warm (optional, no default)

The values for the source and target parameters are the tiers, with source being the tier the index started in and target being the tier it is moving to.

state = failed / active (optional, no default)

The values of the state parameter represent the state of the tiering. failed indicates that the tiering has failed and active means the tiering process is in progress.

index = index1,index2,... (optional, default _all)

The index is a comma separated list of index names used to filter the responses.

h = index,source,target,status,start_time,failure_time,duration,shards_total,shards_successful,shards_active,shards_failed (optional, no default)

The h parameter is only for the _cat API, and it would be used to filter which columns are shown in the response. If this parameter is not passed to the API call, then it will show just the index, source, target, state, and duration columns by default.

v = true / false (optional, default false)

If the v parameter is true, the response will include the column labels as the first row of the response.

s = index,source,target,state,start_time,... (optional, no default)

The s parameter is a comma separated list of column names used to sort the rows in the response.

API Response:

Success:

_cat API
index | source | target | state  | duration
test1 | hot    | warm   | active | 10:00:00

_cat API with all columns
index | source | target | state  | start_time           | duration | shards_total | shards_successful | shards_active | shards_failed 
test1 | hot    | warm   | active | 2024-06-27T00:00:00Z | 10:00:00 | 10           | 3                 | 5             | 2             

Failure:

{
    "error": {
        "root_cause": [
            {
                "type": "",
                "reason": "",
            }
        ],
    },
    "status": xxx
}

Design: Get Tiering Metadata from Cluster State

Since both the status GET and _cat APIs contain mostly the same information but just present it in different formats with slightly different ways for the customer to interact with them, they can both evaluate the status and retrieve the information using the same design.

In this design, the tiering service would store some tiering metadata in the cluster state, and then when the status API is called it would use the tiering metadata to create its response. The migration status is stored in the index settings by the tiering service, while other information like the tiering start time is stored in the index metadata. The status API can use this information from the index settings and metadata to evaluate the tiering status when it is called. Since this information is in the cluster state, it would be relatively fast for the status API to access it. Also, because the cluster state is available from the master node and data nodes, the status API would be able to be called on either type of node.

In the dedicated warm node setup, we could also use the cluster state to check the shard status and determine the tiering progress. However, for the non-dedicated warm node setup, we would need to find another way to check the tiering progress. We could do so by communicating with other nodes through the transport layer to use a service on the data nodes that checks if shards are complete, in-progress, or failed when the status API is called. Then we could use that shard information to fill out details in the verbose response.

Another option that was considered for shard relocation status in the non-dedicated setup was storing the shard level data locality in the tiering metadata. However, this would require frequent cluster state updates to refresh the values of these fields. This would be very costly when accounting for all the shards across all indices that have ongoing tiering.

Order of Operations:

  1. Request received on master node or data node
  2. Validate the request
    1. Check that the experimental feature flag TIERED_REMOTE_INDEX is enabled
    2. Verify that the parameter values are valid
    3. Find index/indices that match the name/pattern in the path
  3. Check the index settings and index metadata for each applicable index in the cluster state
    1. Check if the index has an active tiering or is in a failed state
    2. Filter out indices that don’t match parameters
    3. If detailed:
      1. Get tiering start time
      2. Get shard status stats from cluster state/data node service
  4. Return response

Design: Get Tiering Metadata from Cluster State

tiering_status_design

Pros:

Cons:

Related component

Search:Remote Search

Describe alternatives you've considered

No response

Additional context

Related issues: https://github.com/opensearch-project/OpenSearch/issues/14640 https://github.com/opensearch-project/OpenSearch/issues/14679 https://github.com/opensearch-project/OpenSearch/issues/13294

e-emoto commented 2 months ago

Here are some more example use cases of the APIs:

Get All Ongoing Tierings:

GET /_all/_tiering?state=active

{
    "test1": {
        "source": "hot",
        "target": "warm",
        "state": "active",
        "duration": "10:00:00",
    },
    "test2": {
        "source": "warm",
        "target": "hot",
        "state": "active",
        "duration": "01:00:00",
    },

    ...

}
/_cat/tiering?state=active

test1 | hot    | warm   | active | 10:00:00
test2 | warm   | hot    | active | 01:00:00
/_cat/tiering?state=active&v=true

index | source | target | state  | duration
test1 | hot    | warm   | active | 10:00:00
test2 | warm   | hot    | active | 01:00:00

Get All Failed Hot To Warm Tierings:

GET /_all/_tiering?source=hot&target=warm&state=failed

{
    "test3": {
        "source": "hot",
        "target": "warm",
        "state": "failed",
        "duration": "11:00:00",
    },
    "test4": {
        "source": "hot",
        "target": "warm",
        "state": "failed",
        "duration": "20:00:00",
    },

    ...

}
/_cat/tiering?source=hot&target=warm&state=failed

test3 | hot    | warm   | failed  | 11:00:00
test4 | hot    | warm   | failed  | 20:00:00
/_cat/tiering?source=hot&target=warm&state=failed&v=true

index | source | target | state  | duration
test3 | hot    | warm   | failed | 11:00:00
test4 | hot    | warm   | failed | 20:00:00

Get Shard Details for a Specific Index Tiering:

GET /target_index/_tiering?detailed=true

{
    "target_index": {
        "source": "hot",
        "target": "warm",
        "state": "active",
        "start_time": "2024-06-27T00:00:00Z",
        "duration": "10:00:00",
        "shards": {
            "total": 10, 
            "successful": 4, 
            "failed": 0, 
            "active": 6,
        },
    }
}
/_cat/tiering?index=target_index&h=index,source,target,state,start_time,duration,shards_total,shards_successful,shards_active,shards_failed

target_index | hot    | warm   | active | 2024-06-27T00:00:00Z | 10:00:00 | 10           | 4                 | 6             | 0
/_cat/tiering?index=target_index&h=index,source,target,state,start_time,duration,shards_total,shards_successful,shards_active,shards_failed&v=true

index        | source | target | state  | start_time           | duration | shards_total | shards_successful | shards_active | shards_failed
target_index | hot    | warm   | active | 2024-06-27T00:00:00Z | 10:00:00 | 10           | 4                 | 6             | 0
harishbhakuni commented 2 months ago

Thanks @e-emoto for sharing the proposal. It looks good overall. Just few minor comments:

  1. state = failed / active, does ongoing or in_progress makes more sense than active?
  2. Also, can we use verbose as query parameter instead of detailed?
  3. Also, can we provide start_time without detailed/verbose flag? this way verbose/detailed would mean shard level details of tiering.
  4. Why do we need _cat api to get ongoing tierings? looks like both APIs are gonna provide same information.
  5. Also should it be Get Tiering Metadata from Cluster State as we will be using the metadata stored by tiering service?
dblock commented 2 months ago
e-emoto commented 2 months ago

Thanks for the comments @harishbhakuni

  1. state = failed / active, does ongoing or in_progress makes more sense than active?

We discussed it and decided that saying active was more clear than ongoing to include pending states too

  1. Also, can we use verbose as query parameter instead of detailed?

We decided to use detailed to make it consistent with other APIs

  1. Also, can we provide start_time without detailed/verbose flag? this way verbose/detailed would mean shard level details of tiering.

We're trying to keep the not detailed simple, so I don't know if we need to include the start time since that can be gauged from the duration

  1. Why do we need _cat api to get ongoing tierings? looks like both APIs are gonna provide same information.

The _cat API provides the information in a tabular format, which could be easier to read in some cases

  1. Also should it be Get Tiering Metadata from Cluster State as we will be using the metadata stored by tiering service?

I think this is a good suggestion, I'll update the name

e-emoto commented 2 months ago

Thanks for your response @dblock

I'll check what we're doing for tiering states and try to make it consistent with that

I think this is a good point, we can change it to duration_in_millis to make it consistent with other time measurements

We're using detailed for the GET API, and v as the parameter for the _cat API

  • Please review the other fields and attempt some consistency :)

I'll review the other fields too

lukas-vlcek commented 2 months ago

We're using detailed for the GET API, and v as the parameter for the _cat API

This sounds confusing to me. Let's see if I am on the same page here.

According to this proposal the detailed parameter in the REST API will include more fields (and/or nested objects) into returned JSON response. This is similar to REST API for Cluster health that can include level parameter. For example http://localhost:9200/_cluster/health?level=indices will include indices object with more detailed breakdown of individual indices and corresponding health status of it.

However, in _cat API the v parameter has completely different role. It does not change the number of columns included into the response. It adds a header row.

Perhaps the short sentence is just missing more context information thus I am confused :-)

lukas-vlcek commented 2 months ago

One more detail, the proposal discusses two cases of a node receiving the REST API request: a) the node is a cluster manager, or b) the node is a data node. I think it is just a detail but if the receiving node is neither of these, for example if it is just a search-ing node AND the _local option is used then the response should not include any indices, right?