404 errors when querying traces via HTTP API

nayasam commented 11 months ago

Describe the bug We are load testing (K6) Tempo (version: 2.2.0) deployed in MKS/OpenShift, where we ingest one trace, and reads the particular trace ingested (uding trace iD) continuously via the HTTP API (in a ramp up fashion using K6) for 30 mins. We run these write and reads alternatively. For reads, we use /api/traces/?mode=ingesters HTTP API from our K6 script.

So our test procedure is:

Send one trace with several (10-64) spans to distributor
Ramp up from 1 VUs to 500 VUs over 30 minutes, each sending requests for the same trace via /api/traces/?mode=ingesters
Repeat ad infinitum

Block configuration params are at defaults, e.g., max_block_bytes: 500M max_block_duration: 30m complete_block_timeout: 15m

But we set, replication_factor: 1,

During the process, we observe 404 errors being thrown (i.e., “Trace Not Found” error being thrown) when reading the traces via the HTTP API from the ingester (i.e., /api/traces/?mode=ingesters). However, when the specific trace is flushed to S3, we see that the trace is in the Object Storage via the HTTP API (i.e., /api/traces/?mode=blocks).

Interestingly, at times, we observe that the trace is sent to a particular ingester, but read from a different ingester causing the 404 errors. Although the querier should be querying the trace from all ingesters (with the default setting), it is only looking for a trace from a particular ingester.

We are using the distributed Tempo setup (on MKS/OpenShift), and have 2 distributors, 3 ingesters, 2 queriers, 2 query frontends, and 1 compactor.

joe-elliott commented 11 months ago

But we set, replication_factor: 1,

We don't operate in RF1, but I have often wanted to spend time operationalizing it to see what the pain points are.

During the process, we observe 404 errors being thrown

Are these random? or do once a trace starts 404ing are they consistent?

Interestingly, at times, we observe that the trace is sent to a particular ingester, but read from a different ingester causing the 404 errors. Although the querier should be querying the trace from all ingesters (with the default setting), it is only looking for a trace from a particular ingester.

Are ingesters stable?
Are ingesters restarting for any reason and claiming different tokens?
Why do you think they are looking on a different ingester then was written to?

Apologies for the wall of questions, but you're operating Tempo in a way that I'm not familiar with (but want to learn).

cancub commented 11 months ago

Hi Joe, I'm working on the same project as nayasam.

To clarify:

But we set, replication_factor: 1,

We normally operate with RF=3 and were just testing with RF=1 to confirm that we still saw the issue. We indeed see this issue with either value for replication_factor.

Why do you think they are looking on a different ingester then was written to?

I'm unable to paste a screenshot of what we're seeing (it's layered in red tape), but we can see from the various metrics being scraped from each of the components that the process goes like this:

distributor writes trace to one (and only one) ingester (IngA) [tempo_distributor_spans_received_total]
IngA receives from distributor [tempo_ingester_traces_created_total]
a different ingester IngB receives requests for trace and successfully determines that it does not have this trace [tempo_request_duration_seconds_count]
IngA flushes trace to storage backend [tempo_ingester_blocks_flushed_total]

Note that this is the same behaviour with both RF=1 and RF=3. We had assumed that we'd see writes to each of the three ingesters in (1) and reads from each of the three ingesters in (3), but that has never been the case in our experience. We've only ever seen a write to single ingester and reads from at most two ingesters.

With regard to these questions

Are ingesters stable? Are ingesters restarting for any reason and claiming different tokens?

Is there a metric you would recommend to monitor these state changes? We had a plan to measure health/stability via tempo_ring_members, but we've since hit a roadblock that's preventing us from running new tests/observing previous data.

mdisibio commented 11 months ago

reads from at most two ingesters

Do you happen to have query_relevant_ingesters enabled?

nayasam commented 11 months ago

We have not explicitly set "query_relevant_ingesters", means we are using the default, which queries all healthy ingesters for a given trace id. This is why we are baffled by this behavior, where not all healthy ingesters are queried for a trace id.

nayasam commented 11 months ago

@joe-elliott - Any updates on this issue? Also, if you need any other information to look into this issue, please let us know. Thank you!

joe-elliott commented 10 months ago

Is there a metric you would recommend to monitor these state changes? We had a plan to measure health/stability via tempo_ring_members, but we've since hit a roadblock that's preventing us from running new tests/observing previous data.

Yup. I would watch tempo_ring_members{name="ingester"} and watch for instability there. Also just watch any relevant orchestration metrics to see if the ingesters are restarting unexpectedly.

The behavior you describe is quite strange. I also think you should look at your ingester pages on your distributors and queriers. So visit each distributor/querier and look at /ingester/ring. Every distributor/querier should see the same state of the system (same ingesters with the same tokens). There may be a little variance due to propagation delays in memberlist, but once the system is in a steady state everything should basically be the same except for "last heartbeat".

This page:

cancub commented 10 months ago

We finally got access to a deployment again and we're definitely seeing instability. This is strange though, given that we've scaled back our test so that we are

only writing one, single-span trace every 30 minutes
no longer querying traces

First and foremost: we do not see any restarts of any components, nor do we see high memory usage or CPU usage. In our kubernetes deployment we see no pod over 1 GiB or 0.003 CPU

Additionally, we sometimes see discrepancies of ingester health counts (as measured from tempo_ring_members{name="ingester"}) between

queriers and distributors
queriers themselves
distributors themselves

Right now we see the following in the queriers:

...
tempo_ring_members{name="ingester",state="ACTIVE"} 2
tempo_ring_members{name="ingester",state="JOINING"} 0
tempo_ring_members{name="ingester",state="LEAVING"} 0
tempo_ring_members{name="ingester",state="PENDING"} 0
tempo_ring_members{name="ingester",state="Unhealthy"} 1
...

and the following in distributors:

...
tempo_ring_members{name="ingester",state="ACTIVE"} 1
tempo_ring_members{name="ingester",state="JOINING"} 0
tempo_ring_members{name="ingester",state="LEAVING"} 0
tempo_ring_members{name="ingester",state="PENDING"} 0
tempo_ring_members{name="ingester",state="Unhealthy"} 2
...

Further, when we poll a single distributor for /ingester/ring and tempo_ring_members{name="ingester"} in /metrics, we get discrepancies for these counts as well. For example:

/metrics

...
tempo_ring_members{name="ingester",state="ACTIVE"} 1
tempo_ring_members{name="ingester",state="JOINING"} 0
tempo_ring_members{name="ingester",state="LEAVING"} 0
tempo_ring_members{name="ingester",state="PENDING"} 0
tempo_ring_members{name="ingester",state="Unhealthy"} 2
...

/ingester/ring (sorry for the HTLM, we can't include images)

<td>o11y-tempo-ingester-0</td>
<td>UNHEALTHY</td>
<td>o11y-tempo-ingester:9095</td>
<td>2023-11-02T19:41:06Z</td>
<td>40m42.841s ago (19:56:01)</td>
<td>128</td>
<td>30.6%</td>
...
<td>o11y-tempo-ingester-1</td>
<td>UNHEALTHY</td>
<td>o11y-tempo-ingester:9095</td>
<td>2023-11-02T19:41:53Z</td>
<td>30m40.841s ago (20:06:03)</td>
<td>128</td>
<td>37.2%</td>
...
<td>o11y-tempo-ingester-2</td>
<td>UNHEALTHY</td>
<td>o11y-tempo-ingester:9095</td>
<td>2023-11-02T19:41:54Z</td>
<td>35m39.841s ago (20:01:04)</td>
<td>128</td>
<td>32.1%</td>

which is maybe something to do with the heartbeats not restarting in the "healthy" ingester

With all that in mind, we've got a couple questions:

what does it mean for an ingester to be "healthy" from the observer's perspective?
does replication happen via the distributor or the ingester (that is, does the distributor send traces to RF healthy ingesters or does it send to the one ingester associated with the token for that trace ID and its the responsibility of the receiving ingester to replicate to its peers)?
If the querier receives a request for a trace ID whose token is associated with an unhealthy ingester, does it just fail or does it still send the request to all healthy ingesters?

joe-elliott commented 10 months ago

what does it mean for an ingester to be "healthy" from the observer's perspective?

Tempo uses the ring for service discovery and health (among other things). Each member of the ring heartbeats at a regular interval to let other components know it's healthy. The ring is a data structure that is propagated between all components using memberlist by default. This is a gossip based protocol. For an ingester to be healthy it has to have heartbeated in a certain amount of time.

It seems like you might have some network issues that are preventing the ingesters from reliably heartbeating/participating in the memberlist cluster? Are you seeing memberlist error msgs? Tempo also supports using consul or etcd to store the ring, but I've never operated it in this way.

does replication happen via the distributor or the ingester (that is, does the distributor send traces to RF healthy ingesters or does it send to the one ingester associated with the token for that trace ID and its the responsibility of the receiving ingester to replicate to its peers)?

the distributor handles RF. the issues you were initially describing seem to be due to the fact that the distributors/queriers don't believe some ingesters exist.

If the querier receives a request for a trace ID whose token is associated with an unhealthy ingester, does it just fail or does it still send the request to all healthy ingesters?

if an ingester is unhealthy it will still attempt to query healthy ingesters. once an ingester becomes unhealthy it's tokens are removed from the ring and the querier attempts whoever the new owner is

cancub commented 10 months ago

We're going to investigate the issue of health discrepancies, but I just want to add another bit of information here:

Right now we see that both (2 out of 2) distributors are reporting that they view all 3 ingesters as active, but when we write a one-span trace into the deployment, we see that only one ingester is showing that it has created a trace. We should be expecting that all three ingesters report the creation of a new trace, no? (tempo_ingester_traces_created_total)

In fact, looking closer, we see that there seems to be a one-to-one distributor-to-ingester affinity right now. That is, each distributor, upon receiving a span for a new trace, sends this span to the same, lone ingester each time. Further highlighting this is the fact that the odd ingester out is sitting idle and has yet to create a single trace since we've started this test hours ago. Is this expected?

joe-elliott commented 10 months ago

In fact, looking closer, we see that there seems to be a one-to-one distributor-to-ingester affinity right now. That is, each distributor, upon receiving a span for a new trace, sends this span to the same, lone ingester each time. Further highlighting this is the fact that the odd ingester out is sitting idle and has yet to create a single trace since we've started this test hours ago. Is this expected?

Yeah, this is definitely incorrect. The 2 distributors (under RF3) should be writing to all 3 ingesters. Do the ingester status pages show 3 healthy ingesters on both distributors? They should show the exact same 3 ingesters with same tokens.

You might also try the memberlist status endpoint: /memberlist. It may contain some clues about why the components are not communicating correctly.

cancub commented 10 months ago

Just checked and the output of /memberlist for both distributors, and both are showing a view of all of the expected components, which each showing a state of 0 (Alive). I've just checked our configs and we haven't set replication_factor anywhere, so my assumption is that this should be the default of 3, but I'll set it and update just to confirm.

cancub commented 10 months ago

Same result with replication_factor: 3

cancub commented 10 months ago

Tested in a simplified, non-kubernetes environment with only Tempo components (more on that in a second) and we do indeed see the full replication to replication_factor ingesters, so I doubt the fault lies in Tempo itself but rather in our system for authn and authz.

In the kubernetes environment where we're seeing this issue, there are two primary differences:

each of the pods housing a component also houses a traefik sidecar through which ingress traffic flows (this is to ensure mTLS connections between pods and enforce authorization)
there's an overarching service orchestrating each of the component types

We're looking into these two possibilities.

cancub commented 10 months ago

We managed to resolve this issue. As it turns out, the problem was a hardcoded ingester address in the ingester.lifecycler. This was set to be the name and gRPC port of the ingester service:

...
      ingester:
        ...
        lifecycler:
          ...
          address: o11y-tempo-ingester
          port: 9095
...

This resulted in each of the ingesters reporting their address as being o11y-tempo-ingester as opposed to their cluster IP. We can even see this in the output of /ingester/ring seen in a previous comment in this thread. I imagine (I'm too lazy to go through the code) that the result in the distributor was that it only had one "ingester" address configured to receive traces. In other words, the distributor was attempting to adhere to RF=3, but it only saw one possible target.

From there, the ingester service will obviously just attempt to load balance between the three ingester pods under its purview and thus we only get a one-to-one write from distributors to ingesters rather than the one-to-many that we would expect with RF=3.

As proof of this being the source of the problem, we removed the address line from the config and re-deployed. Immediately, the individual IPs of the ingesters appeared in the ring and writes and queries were received by all three ingesters in our network, eliminating our 404 issue.

Thanks for your help. Feel free to ask any clarifying questions (I know it can be difficult to grok the full set of details without any images)

joe-elliott commented 10 months ago

Thanks for following up! Nice find on the issue. I will close for now, but feel free to reopen if there's anything else to resolve.

grafana / tempo

404 errors when querying traces via HTTP API #3024