opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.81k stars 1.82k forks source link

[Feature Request] Cross cluster search network hop metrics #15971

Open dzane17 opened 2 months ago

dzane17 commented 2 months ago

Is your feature request related to a problem? Please describe

There is currently insufficient visibility into roundtrip cost during Cross-Cluster Search (CCS) requests, complicating the debugging of performance issues. This challenge has recently been encountered by multiple users. Since remote clusters can be located anywhere in the world, they are subject to potentially high network latency. While existing search latency tracking features operate effectively within a single cluster, they do not extend to the CCS use case.

Describe the solution you'd like

Introduce new metrics to track network hops during CCS requests in the search response and/or nodes stats output. These metrics can be presented as cumulative, maximum, or average values.

Search Response Example

GET /remote_cluster:my_index/_search
{
  "query": {
    "match_all": {}
  }
}

{
  "took": 70,
  "ccs_network_took": 30,      // new output
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 200,
      "relation": "eq"
    },
    "max_score": null,
    "hits": [
      {
        "_index": "my_index",
        "_id": "1",
        "_score": 1.0,
        "_source": {
          "field1": "value1",
          "field2": "value2"
        }
      },
      {
        "_index": "my_index",
        "_id": "2",
        "_score": 1.0,
        "_source": {
          "field1": "value3",
          "field2": "value4"
        }
      }
      // Additional documents would follow...
    ]
  }
}

Related component

Search:Remote Search

Describe alternatives you've considered

Alternate response format:

{
  "took": 70,
  "ccs_roundtrips": {
    "total": 105,
    "count": 12
  }
...
}

Additional context

Related CCS phase_took bug: https://github.com/opensearch-project/OpenSearch/issues/15961

getsaurabh02 commented 1 month ago

Thanks @dzane17 . Adding @ankitkala @saikaranam-amazon to provide some additional feedback from the CCR side.

kkhatua commented 1 month ago

@dzane17 Like we have for phase_took, would it make sense to capture following details in an object ?

Last 2 allow the average to be derives and deviation of it from the maximum or minimum can give an indication of a possible problem)

One thing to keep in mind is that unlike in shards of an index, there being a skew between the hop latencies of the different remote cluster's coordinators with the primary coordinator can vary substantially.

Also, what would we do about the phase_took metrics coming from the multiple coordinators?

If we are spitting out a list... we probably need 2 modes, with the list being part of a verbose mode (which will also carry the hop latency) and the default being max projections in addition to the 4 mentioned above.