Open dzane17 opened 2 months ago
Thanks @dzane17 . Adding @ankitkala @saikaranam-amazon to provide some additional feedback from the CCR side.
@dzane17
Like we have for phase_took
, would it make sense to capture following details in an object ?
Last 2 allow the average to be derives and deviation of it from the maximum or minimum can give an indication of a possible problem)
One thing to keep in mind is that unlike in shards of an index, there being a skew between the hop latencies of the different remote cluster's coordinators with the primary coordinator can vary substantially.
Also, what would we do about the phase_took
metrics coming from the multiple coordinators?
phase_took
objects?If we are spitting out a list... we probably need 2 modes, with the list being part of a verbose mode (which will also carry the hop latency) and the default being max projections in addition to the 4 mentioned above.
Is your feature request related to a problem? Please describe
There is currently insufficient visibility into roundtrip cost during Cross-Cluster Search (CCS) requests, complicating the debugging of performance issues. This challenge has recently been encountered by multiple users. Since remote clusters can be located anywhere in the world, they are subject to potentially high network latency. While existing search latency tracking features operate effectively within a single cluster, they do not extend to the CCS use case.
Describe the solution you'd like
Introduce new metrics to track network hops during CCS requests in the search response and/or nodes stats output. These metrics can be presented as cumulative, maximum, or average values.
Search Response Example
Related component
Search:Remote Search
Describe alternatives you've considered
Alternate response format:
Additional context
Related CCS phase_took bug: https://github.com/opensearch-project/OpenSearch/issues/15961