Closed zdelagrange closed 9 months ago
Tagging component owners @carolabadeer @blumamir
I think this is a P1 bug. I'll assign that priority for now and the code owner can update it if required.
Hey we ran into this too, its not isolated to s3. Seems to effect requests that are retried too.
Kinesis without retry: https://sandbox.baselime.io/baselime/sandbox/datasets/otel/traces/1d5a0a575ff7e392a8e7f9d47e20c912
Kinesis with retry: https://sandbox.baselime.io/baselime/sandbox/datasets/otel/traces/2c297889cf56f6d97287ed230852416b
This is using these versions
"@opentelemetry/api": "^1.4.1"
"@opentelemetry/exporter-trace-otlp-http": "^0.40.0"
"@opentelemetry/instrumentation": "^0.40.0"
"@opentelemetry/instrumentation-aws-lambda": "^0.35.2"
"@opentelemetry/instrumentation-aws-sdk": "^0.34.2"
"@opentelemetry/instrumentation-http": "^0.40.0"
"@opentelemetry/sdk-trace-base": "^1.14.0"
"@opentelemetry/sdk-trace-node": "^1.14.0"
From this https://github.com/baselime/lambda-node-opentelemetry
I am looking into a fix and if it's helpful I can post anything that I find here
Howdy,
Any update on this? Would be good to at least get some triage info from the maintainers if possible...
Would upgrading to aws api v3 fix this?
Thank you !!
I ran into this issue during our migration to OpenTelemetry as well, the problem is with instrumentation-http and not the AWS SDK instrumentation specifically.
The _outgoingRequestFunction
function here injects tracing headers into optionsParsed.headers
, but that property holds a reference to the same object that the instrumented code passed to the nodejs request
or get
function's options
object, so the headers will leak back out into the caller's internal state. This can cause a wide variety of unexpected behavior.
Regarding the specific issue that we ran into with the AWS SDK making calls to S3, we have a bucket located in us-east-2
, and when we try to interact with it AWS sends back a redirect to a different API endpoint. The v2 AWS SDK then retries its request against this other endpoint, and in the process of doing so it needs to update the signature on the request's HTTP headers. An injected traceparent header is leaked out into the object that the AWS SDK's internally uses to track the request headers, and this causes the AWS SDK to include the traceparent header in the retry's request signature. The OpenTelemetry instrumentation then injects a different traceparent on this second request, and this causes the request signature to break.
The solution is simple enough, replace the value of optionsParsed.headers
with a clone of the object that it points to. We locally hacked in a patch to this effect and it made the problem go away in our environment. Unfortunately it may take a while before I am able to contribute a proper upstream fix for this issue, but if anybody wants to work that description up into a PR against opentelemetry-js in the meantime then please feel free.
@ten-ans hey I also followed up with this bug on the Aws end. I was surprised that it only affected the V2 sdk and not V3.
I found that the V2 sdk cached the signature it passes adds and does not recompute it on retries.
I raised an issue here and got a simpler work around
AWS.Signers.V4.prototype.unsignableHeaders.push('traceparent');
Which I have tested and confirmed that it works.
Hope this helps
https://github.com/aws/aws-sdk-js/issues/4472#issuecomment-1660786070
Thanks @ten-ans and @Ankcorn for your thorough investigation of this issue!
@Ankcorn from reading through the issue you opened in the AWS SDK repo, it seems that they confirmed it is an issue with the AWS SDK v2 (and does not persist in v3). Is the suggested workaround a change that can be added in an application or an instrumentation change?
Hey @carolabadeer,
Thanks for taking a look at this :) https://github.com/aws/aws-sdk-js/pull/4473 The PR to add traceparent has not been looked at yet.
I also don't know if it totally solves the problem, AFAIK would lambda runtimes still pull in an old and incompatible version of the aws-sdk? This would cause some confusion for new OTEL users.
We might be able to fix this by adding the
AWS.Signers.V4.prototype.unsignableHeaders.push('traceparent');
to the AWS SDK instrumentation.
Thoughts?
Hey @mhausenblas is there someone from the AWS SDK team we can coordinate with? @martinkuba is going to spend some time here too
The issue -- as @ten-ans pointed out above (https://github.com/open-telemetry/opentelemetry-js-contrib/issues/1609#issuecomment-1671657793) -- is that the HttpInstrumentation modifies the headers
object passed into http.request()
(or http.get()
) in-place to add the trace-context header(s):
http.get(someUrl, { headers: { Foo: 'Bar' } }, (res) => { ... });
Often, calling code won't notice this change. However, in aws-sdk code that retries, it keeps a reference to that headers object, so this is what happens:
theHeaders = { Foo: 'Bar' }
. AWSv4 request signing will include that "Foo" header, and any others in there, to create a signature for the request. theHeaders = { Foo: 'Bar', traceparent: 'a-traceparent-for-this-first-request' }
. The request is sent. It fails for some reason, but is retryable.theHeaders
still has the old traceparent value. The AWS SDK now includes both Foo
and traceparent
headers in its request signing. This would normally be fine, however the order is: aws-sdk uses the old traceparent value for the request signature, then OTel changes the traceparent to a new value for this next request, then the request is sent. The server notices the signature mismatch.This can be seen in this example:
const { NodeSDK, tracing } = require('@opentelemetry/sdk-node');
const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');
const sdk = new NodeSDK({
serviceName: 'make-http-req',
spanProcessor: new tracing.SimpleSpanProcessor(new tracing.ConsoleSpanExporter()),
instrumentations: [
new HttpInstrumentation(),
]
})
sdk.start();
const http = require('http');
const myHeaders = {
Foo: 'Bar'
}
http.get('http://localhost:8200/', { headers: myHeaders }, (res) => {
console.log('response: %s %s', res.statusCode, res.headers);
res.resume();
res.on('end', () => {
console.log('response end');
});
});
console.log('http.get() called: was myHeaders mutated? %s', myHeaders);
Running that shows that myHeaders
is mutated:
% node make-http-req.js
http.get() called: was myHeaders mutated? {
Foo: 'Bar',
traceparent: '00-35f4aeff468b2067ae1d5dba82c63ed4-0afcfcdb409c2d13-01'
}
response: 200 {
date: 'Sat, 25 Nov 2023 00:17:39 GMT',
connection: 'close',
'transfer-encoding': 'chunked'
}
{
traceId: '35f4aeff468b2067ae1d5dba82c63ed4',
parentId: undefined,
traceState: undefined,
name: 'GET',
id: '0afcfcdb409c2d13',
kind: 2,
timestamp: 1700871459100000,
duration: 11929.505,
attributes: {
'http.url': 'http://localhost:8200/',
'http.method': 'GET',
'http.target': '/',
'net.peer.name': 'localhost',
'http.host': 'localhost:8200',
'net.peer.ip': '127.0.0.1',
'net.peer.port': 8200,
'http.status_code': 200,
'http.status_text': 'OK',
'http.flavor': '1.1',
'net.transport': 'ip_tcp'
},
status: { code: 0 },
events: [],
links: []
}
response end
I got lucky here, because I hit this before in Elastic's APM agent. A longer write-up is here: https://github.com/elastic/apm-agent-nodejs/issues/2134
It makes sense why -- as noted in https://github.com/open-telemetry/opentelemetry-js/issues/3922 -- that https://github.com/open-telemetry/opentelemetry-js/commit/49787433b66a17a1788a20f3a7edda3aa2580890 changed the behaviour. The header normalization from before that commit was indirectly creating a clone of the headers object.
An alternative to https://github.com/open-telemetry/opentelemetry-js-contrib/pull/1813 would be something like this -- again, as @ten-ans suggested above.
diff --git a/experimental/packages/opentelemetry-instrumentation-http/src/http.ts b/experimental/packages/opentelemetry-instrumentation-http/src/http.ts
index 9422bbc9..780f5c32 100644
--- a/experimental/packages/opentelemetry-instrumentation-http/src/http.ts
+++ b/experimental/packages/opentelemetry-instrumentation-http/src/http.ts
@@ -672,6 +672,10 @@ export class HttpInstrumentation extends InstrumentationBase<Http> {
if (!optionsParsed.headers) {
optionsParsed.headers = {};
+ } else {
+ // Make a copy of the headers object to avoid mutating an object the
+ // caller might have a reference to.
+ optionsParsed.headers = Object.assign({}, optionsParsed.headers);
}
propagation.inject(requestContext, optionsParsed.headers);
@martinkuba Thoughts? Should I make a PR for that?
Should I make a PR for that?
I've added a PR for the alternative fix here: https://github.com/open-telemetry/opentelemetry-js/pull/4346
To be clear, this should be fixed with the next release of @opentelemetry/instrumentation-http
(any release after 0.45.1).
What version of OpenTelemetry are you using?
What version of Node are you using?
v16.20.0
What did you do?
When using the recommended nodejs autoinstrumentation set up found here: https://opentelemetry.io/docs/instrumentation/js/getting-started/nodejs/#setup, While having a global S3 configuration, the autoinstrumentation causes an error to be thrown when the AWS_REGION is set to a region that is not us-east-1 or aws-global. The requests will succeed when the instrumentation.js is not required.
Reproduction here: https://github.com/zdelagrange/otel-nodejs-aws-bug
What did you expect to see?
I expected the S3 functionality to not change when using the auto instrumentation
What did you see instead?
S3 functionality changed when using the auto instrumentation causing a 403/failure to be returned when a success returns without using the auto instrumentation:
Additional context
Reproduction here: https://github.com/zdelagrange/otel-nodejs-aws-bug