Closed Jakob3xD closed 1 year ago
Ingest trace with more then 40k spans and leave the --es.max-doc-count untouched
If max-doc-count is not set then the default in the code is 10'000, so I think this behavior is expected.
If max-doc-count is not set then the default in the code is 10'000, so I think this behavior is expected.
The max doc count is set to 10000 because the Opensearch max search size is 10000. So by default Opensearch can't return more then 10000 hits in a single request. That is why the msearch is doing a loop, so it gets all spans. But this is broken. Right now you will get 40000 spans max due to the terminate_after.
After further debugging I noticed that teminate_after
is acting weird.
Setting the TotalHits to a static int (the real total hits) shows that randomly after the 4th or 5th msearch the hits will be empty, resulting in the trace to be skipped and return incomplete spans.
The log message of the last msearch with no returned hits
{"level":"info","ts":1679569710.7285984,"caller":"zapgrpc/zapgrpc.go:130","msg":"HTTP/1.1 200 OK\r\nContent-Length: 276\r\nConnection: keep-alive\r\nContent-Type: application/json; charset=UTF-8\r\nDate: Thu, 23 Mar 2023 11:08:30 GMT\r\nServer: nginx\r\nStrict-Transport-Security: max-age=15768000\r\n\r\n{\"took\":26,\"responses\":[{\"took\":25,\"timed_out\":false,\"terminated_early\":true,\"num_reduce_phases\":6,\"_shards\":{\"total\":46,\"successful\":46,\"skipped\":0,\"failed\":0},\"_clusters\":{\"total\":5,\"successful\":5,\"skipped\":0},\"hits\":{\"total\":40000,\"max_score\":0.0,\"hits\":[]},\"status\":200}]}\n"}
After more research I found why I am getting 40'000 Hits Total with teminate_after
. teminate_after
works on shard level. As I have 4 shards and the default max doc count is set to 10'000 it will search on all 4 shards for 10'000 hits and will then return a total max hit count of 40'000.
My theory with the issue above is following.
If I have a 2 hour time window on my shards and the first hour is located within the first 10'000 docs of each shard and the second hour is beyond the the 10'000 docs mark. The teminate_after
always limits to the first 10'000 docs of each shards. So when the msearch progresses and moves the search_after
further in time, we will get out of the first hour window resulting in no hits returned as the teminate_after
prevents it.
Therefore I would suggest to remove the teminate_after
by removing line 208.
Therefore I would suggest to remove the
teminate_after
But earlier you said:
The max doc count is set to 10000 because the Opensearch max search size is 10000. So by default Opensearch can't return more then 10000 hits in a single request.
So wouldn't you still have the exact same limitation?
So wouldn't you still have the exact same limitation?
No. Jaeger does some sort of pagination by moving its search start point with the search_after parameter. So it loops through all results. The issue with terminate_after
, it will always return the first 10000 documents of the index shard.
Example: 1 Index with 4 Shards, each shard has 20000 documents. So a total of 80000 documents.
With terminate_after
:
terminate_after
returns after the first 10000 docs in the shard. Remaining 40000 docs are not searchable.Without terminate_after
it would continue looping until all 80000 docs got searched.
not sure I follow about step 5. If the looping is done on the client side, why wouldn't it continue in 2500 chunks? How does ES know that 10k docs from each shard were already loaded? Isn't every request in a loop independent? Or is this some kind of session in play?
Imagine you have 4 arrays of dics/hash with 20000 entries. Those are our shards. When searching with terminate_after it will get the first 10000 entries of each array puts them into an tmp array and sorts them by a time field in the dict. This tmp array is the max amount of search results you can get with pagination/search_after.
Without terminate_after it would copy all 20000 entries of each array into the tmp array and sorts it. You can the use pagination/search_after to get them.
This is just and example to explain it.
The loop of the client ends because the server is not returning any results due to the limit of terminate_after. terminate_after does not move when setting search_after. It will always uses the first 10000 entries of the array. So the remaining 10000 entries are not searchable.
Does this explain it? Otherwise I could try giving you a real example tomorrow.
When searching with terminate_after it will get the first 10000 entries of each array puts them into an tmp array and sorts them by a time field in the dict.
What are you basing this assumption on? That sounds like a pretty naive / stupid implementation to have in the database. ES builds inverted indices at insert time, so you don't have to sort at query time. So when query4 is sent with condition ts >= T3max
condition, the 10k limit should apply only to the entries in the new time range, it should not include previous entries.
It was not an exact implementation example. I just wanted to explain in a simple way. I don't know the exact way how terminate_after works. All I know is that it is broken and prevents jaeger query from handling traces with a large amount of spans.
I just noticed that the search results with terminate_after
are not consistent, what might cause my issue.
Using terminate_after returns different hits each time. The "sort" field of a hit is the "startTime" field of the document.
As you can see it changes with terminate_after and is persistent without it.
# With terminate_after
18:20:38-jakob ~ > curl https://localhost/hc_staging-jaeger-span-2023-04-26/_search -H 'Content-Type: application/json' -d '{"query":{"bool":{"must":[{"term":{"traceID":"8fe4c148f09b017c"}},{"range":{"startTimeMillis":{"from":102606165213,"include_lower":true,"include_upper":true,"to":16825464005213}}}]}},"size":100,"sort":[{"startTime":{"order":"asc"}}],"terminate_after":100}' -s | jq ".hits.hits[-1].sort"
[
1682481922107463
]
18:20:47-jakob ~ > curl https://localhost/hc_staging-jaeger-span-2023-04-26/_search -H 'Content-Type: application/json' -d '{"query":{"bool":{"must":[{"term":{"traceID":"8fe4c148f09b017c"}},{"range":{"startTimeMillis":{"from":102606165213,"include_lower":true,"include_upper":true,"to":16825464005213}}}]}},"size":100,"sort":[{"startTime":{"order":"asc"}}],"terminate_after":100}' -s | jq ".hits.hits[-1].sort"
[
1682481922203986
]
# Without terminate_after
18:20:50-jakob ~ > curl https://localhost/hc_staging-jaeger-span-2023-04-26/_search -H 'Content-Type: application/json' -d '{"query":{"bool":{"must":[{"term":{"traceID":"8fe4c148f09b017c"}},{"range":{"startTimeMillis":{"from":102606165213,"include_lower":true,"include_upper":true,"to":16825464005213}}}]}},"size":100,"sort":[{"startTime":{"order":"asc"}}]}' -s | jq ".hits.hits[-1].sort"
[
1682481922087869
]
18:21:07-jakob ~ > curl https://localhost/hc_staging-jaeger-span-2023-04-26/_search -H 'Content-Type: application/json' -d '{"query":{"bool":{"must":[{"term":{"traceID":"8fe4c148f09b017c"}},{"range":{"startTimeMillis":{"from":102606165213,"include_lower":true,"include_upper":true,"to":16825464005213}}}]}},"size":100,"sort":[{"startTime":{"order":"asc"}}]}' -s | jq ".hits.hits[-1].sort"
[
1682481922087869
]
You can test it your self or maybe change line 52 in plugin/storage/integration/elasticsearch_test.go to 1_000
and your GetLargeSpans tests will fail because it only returns 5000hits as you have 5 shards and use terminate_after
. If you remove the TerminateAfter like in my PR your test will work again.
Do I need to past each search jaeger currently does to show you the results of it or is the evidence from above enough?
What happened?
The Jaeger query is not resolving all spans with Elasticsearch/Opensearch Backend. This happens as the
_msearch
request uses theterminate_after
option in the body, which result in a wronghits.total
value. Due to the wronghits.total
value the search loop stops before getting all spans.Steps to reproduce
--es.max-doc-count
untouchedExpected behavior
I would expect jaeger to resolve all spans from the trace.
Relevant log output
No response
Screenshot
No response
Additional context
teminate_after
is set to max doc count. https://github.com/jaegertracing/jaeger/blob/main/plugin/storage/es/spanstore/reader.go#L208The total hits is compared to the length of all saved trace span hits. https://github.com/jaegertracing/jaeger/blob/main/plugin/storage/es/spanstore/reader.go#L432
Example of how
teminate_after
behaves with hits total:Possible solutions I can think of:
_search
andtrack_total_hits
once before and match against the correct total hits value - additional time spend due to extra request.Jaeger backend version
No response
SDK
No response
Pipeline
No response
Stogage backend
Opensearch 2.6.0
Operating system
Linux
Deployment model
Kubernetes
Deployment configs
No response