jaegertracing / jaeger

CNCF Jaeger, a Distributed Tracing Platform
https://www.jaegertracing.io/
Apache License 2.0
20.61k stars 2.45k forks source link

Elasticsearch query with large number of spans can miss spans or produce duplicates #1739

Closed pavolloffay closed 5 months ago

pavolloffay commented 5 years ago

ES query uses a sorted query with search_after parameter with timeStamp field. However if there are multiple same timestamps the query can produce duplicate or missing results.

A field with one unique value per document should be used as the tiebreaker of the sort specification. Otherwise the sort order for documents that have the same sort values would be undefined. The recommended way is to use the field _uid which is certain to contain one unique value for each document.

Search after was introduced in https://github.com/jaegertracing/jaeger/pull/696

Steps to reproduce:

  1. report 11000 spans with the same timestamp
  2. query the result using Jaeger query (http://localhost:9200/jaeger-span-2019-08-19/_count can be used to verify spans are stored in ES)
  3. the query will show fewer spans that were reported.
RashmiRam commented 3 years ago

@pavolloffay This problem is more visible when viewing a single trace having more than 10k spans. The view is totally distorted if few of the key parent spans were missed during fetching. Shall we employ this tiebreaker approach if request is coming for GetTrace API where single trace is requested to be seen? Won't that perf problem be diluted when viewing a single trace? Please let me know what do you think about this?

yurishkuro commented 3 years ago

if few of the key parent spans were missed during fetching

Why would they be missing, because of 10k limit on the result size? Why not just increase the limit?

RashmiRam commented 3 years ago

if few of the key parent spans were missed during fetching

Why would they be missing, because of 10k limit on the result size? Why not just increase the limit?

Yes. It is missing because of the 10k limit. There are cases where there are more than 50k spans per trace 😅. (I understand that 50k spans per trace is not ideal to have. But, We have multiple consumers for a single producer and one request could produce multiple events which in turn will be consumed by every consumer. Each of the consumers can add their own share of spans. Hence more span count per trace in few cases). We are little hesitant about increasing the size in ES query given the huge volume of data. Would you recommend increasing the size even if there are > 50k spans per trace?

yurishkuro commented 3 years ago

Unless we do a fundamental re-architecture of how ES backend stores spans, I don't see what else can be done other than increasing the limit. It's not unheard of to have >50K span, pretty common, actually. If you legitimately have these large traces, and you have a limit below that size, I don't think there's any reasonable query scheme that will not result in broken traces (I just don't think it's a problem worth solving - storage should be able to return full trace).

You may want to test how the ES cluster would react to 50K spans after you raise the limit, e.g. there is a risk of causing OOMs on the coordinating nodes. I don't know if ES supports some sort of pagination for these use cases.

RashmiRam commented 3 years ago

Thanks @yurishkuro for the detailed response. Yes. We can try increasing the size and test to see how ES is handling this.

jkowall commented 5 months ago

Stale issue, doesn't seem like a bug.