Unstable performance Jaeger UI (query) and lacking insights on the 'why'.

wiardvanrij commented 3 years ago

Describe the bug

In Jaeger UI, select a random service and just press "find traces". Personally I experience the following:

First time query often results in a long page load (10 seconds or higher)
Sometimes changing to a different service to query, this often also takes long to load, at first.
At random, most services, after hitting a few services, start to load faster.
So to follow on that, for some reason, after a while the initial request takes about ~6 seconds and follow up requests ~3 seconds.

I also noticed that:

Ingesting rate on ES seems to be relevant in the overal performance, even though there are no indications that ES is struggling.

Expected behavior I would expect to see ES having issues to respond to the query requests, but that is not the case. Therefore I would assume having quite fast query times. I also would expect consistency between requests. I'm aware of caching, but I feel this is irrelevant to the behaviour I'm expecting.

Stack 5x i3.2xlarge (8 cpu, 31gb memory, 1900GB nvme)

-> kubernetes Jaeger collector -> Kafka @ AWS -> kubernetes Jaeger ingester -> ES @ aws -> Kubernetes Jaeger query/ui

~ 7k/s span ingestion, with peaks to 22k/s. 15 shards, 1 replica (I have also tried 5 shards 1 replica) Rollover indices on 150gb size. 14 day retention. So in general roughly 90 billion records, but currently testing with ~1-10 billion records.

Extra I wish there was more trace information from Jaeger query. For instance including the ES request that has been made? With all due respect, the tracing information from Jaeger does not provide insights in the trace stack.

albertteoh commented 3 years ago

I wish there was more trace information from Jaeger query. For instance including the ES request that has been made? With all due respect, the tracing information from Jaeger does not provide insights in the trace stack.

I hear you @wiardvanrij, I have the same wish :). I can take a look at enabling logging of ES queries.

As for the slow Jaeger UI queries, if I'm not mistaken, Jaeger UI will download all trace data (include spans) when "Find Traces" is clicked, so it can take some time.

How far are you looking back? I believe ES may take a little longer when querying across multiple indices.

We can try executing the same "Find Traces" query directly to elasticsearch; for example using curl:

curl -XPOST http://127.0.0.1:9200/jaeger-span-2020-12-17/_search \
  -d '{
  "aggregations": {
    "traceIDs": {
      "aggregations": {
        "startTime": {
          "max": {
            "field": "startTime"
          }
        }
      },
      "terms": {
        "field": "traceID",
        "order": [
          {
            "startTime": "desc"
          }
        ],
        "size": $LimitResults
      }
    }
  },
  "query": {
    "bool": {
      "must": [
        {
          "range": {
            "startTime": {
              "from": $NowMicroseconds - $LookbackMicroseconds,
              "include_lower": true,
              "include_upper": true,
              "to": $NowMicroseconds
            }
          }
        },
        {
          "match": {
            "process.serviceName": {
              "query": "$MyServiceName"
            }
          }
        }
      ]
    }
  },
  "size": 0
}' \
 -H 'Content-Type: application/json'

Where the following variables are set to those used in the slow Jaeger UI query you observed:

$LimitResults: The "Limit Results" value in Jaeger UI.
$NowMicroseconds: e.g. 1608184760509000.
$LookbackMicroseconds: The "Lookback" value in Jaeger UI. e.g. If "Last Hour", then set this to 3600000000.
$MyServiceName: The "Service" selected in Jaeger UI.
Please make sure the index is corresponds to the index you noticed the slowness on. In the example above, it's jaeger-span-2020-12-17.

If you can see results, please send through the result metadata, as in the example below:

{"took":2198,"timed_out":false,"_shards":{"total":5,"successful":5,"skipped":0,"failed":0},"hits":{"total":{"value":5437,"relation":"eq"},"max_score":null,"hits":[]},"aggregations":{"traceIDs":{"doc_count_error_upper_bound":-1,"sum_other_doc_count":5417

yurishkuro commented 3 years ago

query service could use some enhanced tracing instrumentation

albertteoh commented 3 years ago

BTW, this doesn't reflect all the ES queries made, it's just the first query to fetch traces. Jaeger will then make follow-up requests to fetch spans, which is probably the more time-consuming, but a little hard to do manually (this is where logging the ES query would be useful); but the above query does give some indication of the number of traces we're trying to fetch, how many shards are being queried and how long this first query took.

It's also worth noting that --es.max-doc-count will limit the number of documents fetched from each shard in the index. So if this is set to 1000 and the index has 5 shards, a trace with 20,000 spans will only have 5000 spans returned (with 5 ES queries made).

wiardvanrij commented 3 years ago

Well, this gives some insights lol.

{"took":25715,"timed_out":false,"_shards":{"total":30,"successful":30,"skipped":0,"failed":0},"hits":{"total":{"value":0,"relation":"eq"},"max_score":null,"hits":[]},"aggregations":{"traceIDs":{"doc_count_error_upper_bound":0,"sum_other_doc_count":0,"buckets":[]}}}```

I'm still curious if Jaeger hits the same index. As I use aliases with a read alias, because rollover. Now I've got something to test out, why Elastic is being so slow on me.

albertteoh commented 3 years ago

I'm still curious if Jaeger hits the same index. As I use aliases with a read alias, because rollover. Now I've got something to test out, why Elastic is being so slow on me.

Jaeger will look back --es.max-span-age on read aliases, so it won't query across all indices.

HryhoriiDidenko commented 1 year ago

Hey, I was trying to play with these options --es.max-doc-count and --es.max-span-age and still have the same problem. First time query often results in a long page load (20 seconds or higher)

trungtien commented 8 months ago

I have the same issue. Has anyone fixed this problem? :(

jaegertracing / jaeger

Unstable performance Jaeger UI (query) and lacking insights on the 'why'. #2693