[Bug]: Elastic/Opensearch backend does not return all trace spans

jaegertracing / jaeger

CNCF Jaeger, a Distributed Tracing Platform

https://www.jaegertracing.io/

Apache License 2.0

20.29k stars 2.42k forks source link

[Bug]: Elastic/Opensearch backend does not return all trace spans #4330

Closed Jakob3xD closed 1 year ago

Jakob3xD commented 1 year ago

What happened?

The Jaeger query is not resolving all spans with Elasticsearch/Opensearch Backend. This happens as the _msearch request uses the terminate_after option in the body, which result in a wrong hits.total value. Due to the wrong hits.total value the search loop stops before getting all spans.

Steps to reproduce

Setup Jaeger with Elasticsearch or Openearch Backend
Ingest trace with more then 40k spans and leave the --es.max-doc-count untouched
Try to get the trace via api or UI

Expected behavior

I would expect jaeger to resolve all spans from the trace.

Relevant log output

No response

Screenshot

No response

Additional context

teminate_after is set to max doc count. https://github.com/jaegertracing/jaeger/blob/main/plugin/storage/es/spanstore/reader.go#L208

The total hits is compared to the length of all saved trace span hits. https://github.com/jaegertracing/jaeger/blob/main/plugin/storage/es/spanstore/reader.go#L432

Example of how teminate_after behaves with hits total:

GET staging-jaeger-span-read/_search
{

  "terminate_after": 5
}

{
  "took": 24,
  "timed_out": false,
  "terminated_early": true,
  "_shards": {
    "total": 51,
    "successful": 51,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 230,
      "relation": "eq"
    },
    "max_score": 1,
    "hits":  [...]
  }
}

GET staging-jaeger-span-read/_search

{
  "took": 33,
  "timed_out": false,
  "_shards": {
    "total": 51,
    "successful": 51,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 10000,
      "relation": "gte"
    },
    "max_score": 1,
    "hits": [...]
  }
}

Possible solutions I can think of:

Dont set terminate_after or set it to 0 - Could lead to and increased query time per msearch request
Get total hits per trace with _search and track_total_hits once before and match against the correct total hits value - additional time spend due to extra request.

Jaeger backend version

No response

SDK

No response

Pipeline

No response

Stogage backend

Opensearch 2.6.0

Operating system

Linux

Deployment model

Kubernetes

Deployment configs

No response

yurishkuro commented 1 year ago

Ingest trace with more then 40k spans and leave the --es.max-doc-count untouched

If max-doc-count is not set then the default in the code is 10'000, so I think this behavior is expected.

Jakob3xD commented 1 year ago

If max-doc-count is not set then the default in the code is 10'000, so I think this behavior is expected.

The max doc count is set to 10000 because the Opensearch max search size is 10000. So by default Opensearch can't return more then 10000 hits in a single request. That is why the msearch is doing a loop, so it gets all spans. But this is broken. Right now you will get 40000 spans max due to the terminate_after.

Jakob3xD commented 1 year ago

After further debugging I noticed that teminate_after is acting weird. Setting the TotalHits to a static int (the real total hits) shows that randomly after the 4th or 5th msearch the hits will be empty, resulting in the trace to be skipped and return incomplete spans.

The log message of the last msearch with no returned hits

{"level":"info","ts":1679569710.7285984,"caller":"zapgrpc/zapgrpc.go:130","msg":"HTTP/1.1 200 OK\r\nContent-Length: 276\r\nConnection: keep-alive\r\nContent-Type: application/json; charset=UTF-8\r\nDate: Thu, 23 Mar 2023 11:08:30 GMT\r\nServer: nginx\r\nStrict-Transport-Security: max-age=15768000\r\n\r\n{\"took\":26,\"responses\":[{\"took\":25,\"timed_out\":false,\"terminated_early\":true,\"num_reduce_phases\":6,\"_shards\":{\"total\":46,\"successful\":46,\"skipped\":0,\"failed\":0},\"_clusters\":{\"total\":5,\"successful\":5,\"skipped\":0},\"hits\":{\"total\":40000,\"max_score\":0.0,\"hits\":[]},\"status\":200}]}\n"}

After more research I found why I am getting 40'000 Hits Total with teminate_after. teminate_after works on shard level. As I have 4 shards and the default max doc count is set to 10'000 it will search on all 4 shards for 10'000 hits and will then return a total max hit count of 40'000.

My theory with the issue above is following. If I have a 2 hour time window on my shards and the first hour is located within the first 10'000 docs of each shard and the second hour is beyond the the 10'000 docs mark. The teminate_after always limits to the first 10'000 docs of each shards. So when the msearch progresses and moves the search_after further in time, we will get out of the first hour window resulting in no hits returned as the teminate_after prevents it.

Therefore I would suggest to remove the teminate_after by removing line 208.

yurishkuro commented 1 year ago

Therefore I would suggest to remove the teminate_after

But earlier you said:

The max doc count is set to 10000 because the Opensearch max search size is 10000. So by default Opensearch can't return more then 10000 hits in a single request.

So wouldn't you still have the exact same limitation?

Jakob3xD commented 1 year ago

So wouldn't you still have the exact same limitation?

No. Jaeger does some sort of pagination by moving its search start point with the search_after parameter. So it loops through all results. The issue with terminate_after, it will always return the first 10000 documents of the index shard.

Example: 1 Index with 4 Shards, each shard has 20000 documents. So a total of 80000 documents. With terminate_after:

First search will give you 10000 documents. Lets assume each shard returns the same amount of docs. So each shard return 2500 documents.
Second search with the search_after set to the time of the oldest document of the fist search. Each shard returns 2500 docs.
Third search with the search_after set to the time of the oldest document of the second search. Each shard returns 2500 docs.
Fourth search with the search_after set to the time of the oldest document of the third search. Each shard returns 2500 docs.
Fourth search with the search_after set to the time of the oldest document of the fourth search. Each shard returns 0 docs as the terminate_after returns after the first 10000 docs in the shard. Remaining 40000 docs are not searchable.

Without terminate_after it would continue looping until all 80000 docs got searched.

yurishkuro commented 1 year ago

not sure I follow about step 5. If the looping is done on the client side, why wouldn't it continue in 2500 chunks? How does ES know that 10k docs from each shard were already loaded? Isn't every request in a loop independent? Or is this some kind of session in play?

Jakob3xD commented 1 year ago

Imagine you have 4 arrays of dics/hash with 20000 entries. Those are our shards. When searching with terminate_after it will get the first 10000 entries of each array puts them into an tmp array and sorts them by a time field in the dict. This tmp array is the max amount of search results you can get with pagination/search_after.

Without terminate_after it would copy all 20000 entries of each array into the tmp array and sorts it. You can the use pagination/search_after to get them.

This is just and example to explain it.

The loop of the client ends because the server is not returning any results due to the limit of terminate_after. terminate_after does not move when setting search_after. It will always uses the first 10000 entries of the array. So the remaining 10000 entries are not searchable.

Does this explain it? Otherwise I could try giving you a real example tomorrow.

yurishkuro commented 1 year ago

When searching with terminate_after it will get the first 10000 entries of each array puts them into an tmp array and sorts them by a time field in the dict.

What are you basing this assumption on? That sounds like a pretty naive / stupid implementation to have in the database. ES builds inverted indices at insert time, so you don't have to sort at query time. So when query4 is sent with condition ts >= T3max condition, the 10k limit should apply only to the entries in the new time range, it should not include previous entries.

Jakob3xD commented 1 year ago

It was not an exact implementation example. I just wanted to explain in a simple way. I don't know the exact way how terminate_after works. All I know is that it is broken and prevents jaeger query from handling traces with a large amount of spans.

I just noticed that the search results with terminate_after are not consistent, what might cause my issue. Using terminate_after returns different hits each time. The "sort" field of a hit is the "startTime" field of the document. As you can see it changes with terminate_after and is persistent without it.

# With terminate_after
18:20:38-jakob ~ >  curl https://localhost/hc_staging-jaeger-span-2023-04-26/_search -H 'Content-Type: application/json' -d '{"query":{"bool":{"must":[{"term":{"traceID":"8fe4c148f09b017c"}},{"range":{"startTimeMillis":{"from":102606165213,"include_lower":true,"include_upper":true,"to":16825464005213}}}]}},"size":100,"sort":[{"startTime":{"order":"asc"}}],"terminate_after":100}' -s | jq ".hits.hits[-1].sort"
[
  1682481922107463
]
18:20:47-jakob ~ >  curl https://localhost/hc_staging-jaeger-span-2023-04-26/_search -H 'Content-Type: application/json' -d '{"query":{"bool":{"must":[{"term":{"traceID":"8fe4c148f09b017c"}},{"range":{"startTimeMillis":{"from":102606165213,"include_lower":true,"include_upper":true,"to":16825464005213}}}]}},"size":100,"sort":[{"startTime":{"order":"asc"}}],"terminate_after":100}' -s | jq ".hits.hits[-1].sort"
[
  1682481922203986
]

# Without terminate_after
18:20:50-jakob ~ >  curl https://localhost/hc_staging-jaeger-span-2023-04-26/_search -H 'Content-Type: application/json' -d '{"query":{"bool":{"must":[{"term":{"traceID":"8fe4c148f09b017c"}},{"range":{"startTimeMillis":{"from":102606165213,"include_lower":true,"include_upper":true,"to":16825464005213}}}]}},"size":100,"sort":[{"startTime":{"order":"asc"}}]}' -s | jq ".hits.hits[-1].sort"
[
  1682481922087869
]
18:21:07-jakob ~ >  curl https://localhost/hc_staging-jaeger-span-2023-04-26/_search -H 'Content-Type: application/json' -d '{"query":{"bool":{"must":[{"term":{"traceID":"8fe4c148f09b017c"}},{"range":{"startTimeMillis":{"from":102606165213,"include_lower":true,"include_upper":true,"to":16825464005213}}}]}},"size":100,"sort":[{"startTime":{"order":"asc"}}]}' -s | jq ".hits.hits[-1].sort"
[
  1682481922087869
]

You can test it your self or maybe change line 52 in plugin/storage/integration/elasticsearch_test.go to 1_000 and your GetLargeSpans tests will fail because it only returns 5000hits as you have 5 shards and use terminate_after. If you remove the TerminateAfter like in my PR your test will work again.

Do I need to past each search jaeger currently does to show you the results of it or is the evidence from above enough?