Closed koraven closed 3 weeks ago
Hi thanks for the detailed information. Agree it does seem like metrics are missing, as the screenshot of search results has spans within the same time range as the metrics query. You have already covered the basics including enabling the local-blocks processor and flush_to_storage.
The next thing to check is the in-memory queue between the distributors and the generators. If the queue is full then the distributors will not be able to send incoming spans to the generators, and therefore they will not get flushed to storage. It would show up as missing metrics, but search still works because it reads the data flushed by the ingesters.
Please take a look at the following metrics and see if there are any discards:
tempo_distributor_forwarder_pushes_failures_total
This is when the per-tenant queue was full. If this is non-zero it can be increased with the tenant override setting: metrics_generator_forwarder_queue_size
. Typical values are 1000-5000, but you can go higher if needed.tempo_distributor_metrics_generator_pushes_failures_total
This is when data couldn't be sent to the generators from the queue. If this is non-zero it usually means the generators can't keep up and need more replicas.Let's check these 2 metrics and see if those are happening.
tempo_distributor_forwarder_pushes_failures_total
is 0 for the last week.
tempo_distributor_metrics_generator_pushes_failures_total
- rare spikes for 5-10 seconds once a day.
I also checked resources in general. All fine. Also, I would expect similar picture for all spans, but it looks like the issue affects only some types of spans. I haven't noticed yet any issues with server spans. Other kinds have issues.
The screenshot was made today (18.10) for 3 days old data. That data is definitely in s3 bucket, and there should not be any issues with queues. But still server kind is fine while 'consumer' is 0.
I also checked what "metrics from spans" generates for prometheus:
sum(rate(traces_spanmetrics_calls_total{service="<service>",span_kind="SPAN_KIND_CONSUMER"}[1m]))
shows reasonable values
I can try to check what's inside the blocks in S3 with Pandas or other tools, but I don't know how to find blocks generated by metrics generator and distinguish them from ingester blocks
ok. Looks like the issue was about "filter_server_spans". I set it to false and now it works as expected.
As I understand now, the rare spikes in non-server spans I saw were parent spans of non-server kind.
I hope I am the only one who was confused by "only parent spans" part because I knew about 'filter_server_spans' setting, but was sure I have no issues with it because I saw non-server spans in the output. Eventually, it appeared I missed that part about "only parent spans or spans with the SpanKind of server
"
Describe the bug We are currently testing traceQL metrics as it's one of key features our developers need and I noticed strange behavior.
We have missing values with traceQL metrics. There are no values with any metrics function sometimes even when I definitely know there are spans (by using search without metrics functions). It looks like this (Last hour timeframe): Just a short line. It does not matter if it's rate() or quantile_over_time
If I remove 'span.db.system="mongodb"' from condition I will see the 1 hour of metrics. I am sure there are data when metrics are empty because if I select timeframe without metrics and just run a search, I easily find 500 traces (each contains multiple spans matching condition)
To Reproduce Hard to say. Usually I see broken metrics when trying to get metrics from spans of "client" kind or "consumer". Not sure if it's related, maybe there are just less of them.
Expected behavior If there are spans I want to see metrics generated from them.
Environment:
Additional Context I don't see anything suspicious in logs. However, I have "warn" level. I'll try with info.
Seems like it does not matter if data is on metrics-generator or on backend. When I run the queries for the data which is definitely in S3 it looks a bit better, but still usually the beginning and the last 20 minutes of hour is empty.
Here are parts of my tempo.yaml: