Open johnalotoski opened 1 year ago
We appreciate the detailed triage of this issue. Thank you for the effort you've put in.
In this case we rely on minio's s3 client for handling all of the s3 authentication. I'm wondering if there's some configuration change that can be made to prevent it from reaching out to the listed endpoint incorrectly.
So it appears tempo makes a call to obtain an IMDSv2 token twice, timing out each time after 2 minutes, therefore blocking each startup for a total of ~4 minutes
This makes sense. We actually create two clients. One of which has hedged requests and the other does not.
Even though Tempo fails to obtain an IMDSv2 token in this circumstance, it appears unnecessary as far as I can tell as the s3 bucket for the backend storage is already accessible and has proper permissions via assigned ec2 host IAM profile.
It's been awhile since I worked in AWS, but I thought I recall that even with the ec2 IAM profile the process still had to reach out to http://169.254.169.254 to retrieve a token. Is this not the case? What am I misunderstanding about this?
Root cause of the IMDSv2 timeout problem appears to be a requirement for the ec2 machine to have a http-put-response-hop-limit of 2 for containers
This is really strange. Does ec2 limit the number of hops specifically for http put requests? Is something just dropping the packets b/c the number of hops is exceeded with default config?
Apparently minio have a often referenced long running issue here:
Hi @joe-elliott, thanks for the thanks! :).
In this case we rely on minio's s3 client for handling all of the s3 authentication. I'm wondering if there's some configuration change that can be made to prevent it from reaching out to the listed endpoint incorrectly.
I'm not really familiar with the Minio client, but I suspect it was correct in its behavior to try and reach out for instance metadata in this case, see next comment on why.
It's been awhile since I worked in AWS, but I thought I recall that even with the ec2 IAM profile the process still had to reach out to http://169.254.169.254/ to retrieve a token. Is this not the case? What am I misunderstanding about this?
Just a guess as to what happened in this case: since s3 credentials were not provided in the form of AWS config file, Minio config file, or passed directly as secrets to Tempo (ie: via env var expansion to access_key
and secret_key
tempo config parameters), the Minio IO client Tempo is using then makes a last effort attempt to obtain s3 credentials for a bucket which is by good assumption not world readable/writeable by assuming the machine it's running on is a cloud machine with cloud metadata available and attempts to gain access to the IAM credentials the machine may already have been granted through the instance profile IAM role permissions, ie:
curl http://169.254.169.254/latest/meta-data/identity-credentials/ec2/security-credentials/ec2-instance
This seems like a reasonable thing for the client to do (and if I'm wrong on why it's trying to access metadata in the first place, then I have no idea why else it would need metadata access). It tries a PUT to IMDSv2 to obtain session access to metadata to obtain those instance profile IAM credentials, and when that failed after 2 tries and 4 minutes, presumably it fell back to IMDSv1 and just issued a direct GET to the metadata security-credentials endpoint which IMDSv1, if still enabled on the ec2 machine (it is by default, ref) should readily supply.
This is really strange. Does ec2 limit the number of hops specifically for http put requests? Is something just dropping the packets b/c the number of hops is exceeded with default config?
Yes, according to the docs (see "ref" link immediately above), this is a hop limit specific to the IMDSv2 PUT endpoint only, which explains why curls from within the container at 2 hops away from the metadata server still return requests successfully to IMDSv1 endpoints when IMDSv1 is also enabled. And yes, the default PUT hop limit is "1", so other Tempo users in ec2 with containers will probably run into this as well.
Not sure what the best way to resolve this would be, but I can say that even a single WARN log line pointing to IMDSv2 hop limit being a potential problem in the case of long call time would have saved me some number of hours of investigating. Maybe that wouldn't be difficult too implement if there isn't a better short term fix?
Can confirm, seeing the same behavior on kubernetes running on plain ec2 (not eks
) hosts
This issue has been automatically marked as stale because it has not had any activity in the past 60 days. The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity. Please apply keepalive label to exempt this Issue.
This is still an issue for use
I'm unsure what we can do except provide better documentation. WDYT? Is there a fix here on our side?
Would it be possible to have tempo log a line pointing a user to appropriate documentation on this issue in the case they encounter this problem? That would have saved me, and probably anybody else who encounters the problem, a few hours of research trying to understand what is broken where and what to do about it.
Would it be possible to have tempo log a line pointing a user to appropriate documentation on this issue in the case they encounter this problem? That would have saved me, and probably anybody else who encounters the problem, a few hours of research trying to understand what is broken where and what to do about it.
I really like this idea. It's a nasty issue that presumably impacts a fair number of people. Bonus points if we can do it explicitly on timeout.
This issue has been automatically marked as stale because it has not had any activity in the past 60 days. The next time this stale check runs, the stale label will be removed if there is new activity. The issue will be closed after 15 days if there is no new activity. Please apply keepalive label to exempt this Issue.
When starting up tempo inside a container on an ec2 machine with our standard ec2 launch configuration, tempo is delayed in starting by about 4 minutes. Logging did not indicate the root cause of stalling. Debug logging did not help. Stracing the application showed that after trying to open both of
~/.aws/credentials
and~/.mc/config.json
, neither of which existed in this case, a call to obtain anIMDSv2
token was made, tcpflow of packets show:Test curling this endpoint from within the container on the ec2 host confirms a ~2 minute timeout before the curl connection is closed:
So it appears tempo makes a call to obtain an IMDSv2 token twice, timing out each time after 2 minutes, therefore blocking each startup for a total of ~4 minutes. Even though Tempo fails to obtain an IMDSv2 token in this circumstance, it appears unnecessary as far as I can tell as the s3 bucket for the backend storage is already accessible and has proper permissions via assigned ec2 host IAM profile.
Root cause of the IMDSv2 timeout problem appears to be a requirement for the ec2 machine to have a
http-put-response-hop-limit
of 2 for containers which will be making calls for IMDSv2 tokens on ec2 machines.Modifying ec2 host metadata
http-put-response-hop-limit
to 2 resolves the 4 minute startup blocking and tempo starts up like normal.Expected behavior No extended startup blocking, or at least if there is extended blocking, a log line indicating what might be stalling the app and how to fix it. I spend a good part of my day tracking down what the problem was here.
Environment:
Additional Context
Log example of stall on startup:
IMDSv2 reference link.
Thank you.