grafana / tempo

Grafana Tempo is a high volume, minimal dependency distributed tracing backend.
https://grafana.com/oss/tempo/
GNU Affero General Public License v3.0
3.76k stars 489 forks source link

Tempo in ec2 machine inside a container blocks for 4 minutes on startup #1727

Open johnalotoski opened 1 year ago

johnalotoski commented 1 year ago

When starting up tempo inside a container on an ec2 machine with our standard ec2 launch configuration, tempo is delayed in starting by about 4 minutes. Logging did not indicate the root cause of stalling. Debug logging did not help. Stracing the application showed that after trying to open both of ~/.aws/credentials and ~/.mc/config.json, neither of which existed in this case, a call to obtain an IMDSv2 token was made, tcpflow of packets show:

172.026.066.018.35836-169.254.169.254.00080: PUT /latest/api/token HTTP/1.1
Host: 169.254.169.254

172.026.066.018.35922-169.254.169.254.00080: PUT / HTTP/1.1
Host: 169.254.169.254

169.254.169.254.00080-172.026.066.018.35922: HTTP/1.1 403 Forbidden
Content-Length: 0
Date: Thu, 08 Sep 2022 23:31:36 GMT
Server: EC2ws
Connection: close
Content-Type: text/plain

Test curling this endpoint from within the container on the ec2 host confirms a ~2 minute timeout before the curl connection is closed:

bash-5.1# time curl -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600"
curl: (56) Recv failure: Connection reset by peer

real    2m3.311s
user    0m0.004s
sys     0m0.005s

So it appears tempo makes a call to obtain an IMDSv2 token twice, timing out each time after 2 minutes, therefore blocking each startup for a total of ~4 minutes. Even though Tempo fails to obtain an IMDSv2 token in this circumstance, it appears unnecessary as far as I can tell as the s3 bucket for the backend storage is already accessible and has proper permissions via assigned ec2 host IAM profile.

Root cause of the IMDSv2 timeout problem appears to be a requirement for the ec2 machine to have a http-put-response-hop-limit of 2 for containers which will be making calls for IMDSv2 tokens on ec2 machines.

Modifying ec2 host metadata http-put-response-hop-limit to 2 resolves the 4 minute startup blocking and tempo starts up like normal.

Expected behavior No extended startup blocking, or at least if there is extended blocking, a log line indicating what might be stalling the app and how to fix it. I spend a good part of my day tracking down what the problem was here.

Environment:

Additional Context

Log example of stall on startup:

level=info ts=2022-09-08T22:27:54.393680189Z caller=main.go:191 msg="initialising OpenTracing tracer"
level=info ts=2022-09-08T22:27:54.419730857Z caller=main.go:106 msg="Starting Tempo" version="(version=, branch=, revision=)"
level=info ts=2022-09-08T22:29:55.406148331Z caller=server.go:306 http=127.0.0.1:3200 grpc=[::]:9096 msg="server listening on addresses"
level=info ts=2022-09-08T22:31:56.195848571Z caller=frontend.go:43 msg="creating middleware in query frontend"
level=info ts=2022-09-08T22:31:56.197152275Z caller=tempodb.go:411 msg="polling enabled" interval=5m0s concurrency=50
...

IMDSv2 reference link.

Thank you.

joe-elliott commented 1 year ago

We appreciate the detailed triage of this issue. Thank you for the effort you've put in.

In this case we rely on minio's s3 client for handling all of the s3 authentication. I'm wondering if there's some configuration change that can be made to prevent it from reaching out to the listed endpoint incorrectly.

So it appears tempo makes a call to obtain an IMDSv2 token twice, timing out each time after 2 minutes, therefore blocking each startup for a total of ~4 minutes

This makes sense. We actually create two clients. One of which has hedged requests and the other does not.

Even though Tempo fails to obtain an IMDSv2 token in this circumstance, it appears unnecessary as far as I can tell as the s3 bucket for the backend storage is already accessible and has proper permissions via assigned ec2 host IAM profile.

It's been awhile since I worked in AWS, but I thought I recall that even with the ec2 IAM profile the process still had to reach out to http://169.254.169.254 to retrieve a token. Is this not the case? What am I misunderstanding about this?

Root cause of the IMDSv2 timeout problem appears to be a requirement for the ec2 machine to have a http-put-response-hop-limit of 2 for containers

This is really strange. Does ec2 limit the number of hops specifically for http put requests? Is something just dropping the packets b/c the number of hops is exceeded with default config?

Apparently minio have a often referenced long running issue here:

https://github.com/aws/aws-sdk-go/issues/2972

johnalotoski commented 1 year ago

Hi @joe-elliott, thanks for the thanks! :).

In this case we rely on minio's s3 client for handling all of the s3 authentication. I'm wondering if there's some configuration change that can be made to prevent it from reaching out to the listed endpoint incorrectly.

I'm not really familiar with the Minio client, but I suspect it was correct in its behavior to try and reach out for instance metadata in this case, see next comment on why.

It's been awhile since I worked in AWS, but I thought I recall that even with the ec2 IAM profile the process still had to reach out to http://169.254.169.254/ to retrieve a token. Is this not the case? What am I misunderstanding about this?

Just a guess as to what happened in this case: since s3 credentials were not provided in the form of AWS config file, Minio config file, or passed directly as secrets to Tempo (ie: via env var expansion to access_key and secret_key tempo config parameters), the Minio IO client Tempo is using then makes a last effort attempt to obtain s3 credentials for a bucket which is by good assumption not world readable/writeable by assuming the machine it's running on is a cloud machine with cloud metadata available and attempts to gain access to the IAM credentials the machine may already have been granted through the instance profile IAM role permissions, ie:

curl http://169.254.169.254/latest/meta-data/identity-credentials/ec2/security-credentials/ec2-instance

This seems like a reasonable thing for the client to do (and if I'm wrong on why it's trying to access metadata in the first place, then I have no idea why else it would need metadata access). It tries a PUT to IMDSv2 to obtain session access to metadata to obtain those instance profile IAM credentials, and when that failed after 2 tries and 4 minutes, presumably it fell back to IMDSv1 and just issued a direct GET to the metadata security-credentials endpoint which IMDSv1, if still enabled on the ec2 machine (it is by default, ref) should readily supply.

This is really strange. Does ec2 limit the number of hops specifically for http put requests? Is something just dropping the packets b/c the number of hops is exceeded with default config?

Yes, according to the docs (see "ref" link immediately above), this is a hop limit specific to the IMDSv2 PUT endpoint only, which explains why curls from within the container at 2 hops away from the metadata server still return requests successfully to IMDSv1 endpoints when IMDSv1 is also enabled. And yes, the default PUT hop limit is "1", so other Tempo users in ec2 with containers will probably run into this as well.

Not sure what the best way to resolve this would be, but I can say that even a single WARN log line pointing to IMDSv2 hop limit being a potential problem in the case of long call time would have saved me some number of hours of investigating. Maybe that wouldn't be difficult too implement if there isn't a better short term fix?

hagaibarel commented 1 year ago

Can confirm, seeing the same behavior on kubernetes running on plain ec2 (not eks) hosts

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity. Please apply keepalive label to exempt this Issue.

hagaibarel commented 1 year ago

This is still an issue for use

joe-elliott commented 1 year ago

I'm unsure what we can do except provide better documentation. WDYT? Is there a fix here on our side?

johnalotoski commented 1 year ago

Would it be possible to have tempo log a line pointing a user to appropriate documentation on this issue in the case they encounter this problem? That would have saved me, and probably anybody else who encounters the problem, a few hours of research trying to understand what is broken where and what to do about it.

joe-elliott commented 1 year ago

Would it be possible to have tempo log a line pointing a user to appropriate documentation on this issue in the case they encounter this problem? That would have saved me, and probably anybody else who encounters the problem, a few hours of research trying to understand what is broken where and what to do about it.

I really like this idea. It's a nasty issue that presumably impacts a fair number of people. Bonus points if we can do it explicitly on timeout.

github-actions[bot] commented 8 months ago

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity. Please apply keepalive label to exempt this Issue.