Closed pavolloffay closed 5 months ago
@pavolloffay This problem is more visible when viewing a single trace having more than 10k spans. The view is totally distorted if few of the key parent spans were missed during fetching. Shall we employ this tiebreaker approach if request is coming for GetTrace API where single trace is requested to be seen? Won't that perf problem be diluted when viewing a single trace? Please let me know what do you think about this?
if few of the key parent spans were missed during fetching
Why would they be missing, because of 10k limit on the result size? Why not just increase the limit?
if few of the key parent spans were missed during fetching
Why would they be missing, because of 10k limit on the result size? Why not just increase the limit?
Yes. It is missing because of the 10k limit. There are cases where there are more than 50k spans per trace 😅. (I understand that 50k spans per trace is not ideal to have. But, We have multiple consumers for a single producer and one request could produce multiple events which in turn will be consumed by every consumer. Each of the consumers can add their own share of spans. Hence more span count per trace in few cases). We are little hesitant about increasing the size in ES query given the huge volume of data. Would you recommend increasing the size even if there are > 50k spans per trace?
Unless we do a fundamental re-architecture of how ES backend stores spans, I don't see what else can be done other than increasing the limit. It's not unheard of to have >50K span, pretty common, actually. If you legitimately have these large traces, and you have a limit below that size, I don't think there's any reasonable query scheme that will not result in broken traces (I just don't think it's a problem worth solving - storage should be able to return full trace).
You may want to test how the ES cluster would react to 50K spans after you raise the limit, e.g. there is a risk of causing OOMs on the coordinating nodes. I don't know if ES supports some sort of pagination for these use cases.
Thanks @yurishkuro for the detailed response. Yes. We can try increasing the size and test to see how ES is handling this.
Stale issue, doesn't seem like a bug.
ES query uses a sorted query with
search_after
parameter withtimeStamp
field. However if there are multiple same timestamps the query can produce duplicate or missing results.Search after was introduced in https://github.com/jaegertracing/jaeger/pull/696
Steps to reproduce: