open-telemetry / opentelemetry-js-contrib

OpenTelemetry instrumentation for JavaScript modules
https://opentelemetry.io
Apache License 2.0
678 stars 497 forks source link

AWS S3 behavior changes when using the auto instrumentation #1609

Closed zdelagrange closed 9 months ago

zdelagrange commented 1 year ago

What version of OpenTelemetry are you using?

"@opentelemetry/auto-instrumentations-node": "^0.37.1",
"@opentelemetry/sdk-node": "^0.40.0",

What version of Node are you using?

v16.20.0

What did you do?

When using the recommended nodejs autoinstrumentation set up found here: https://opentelemetry.io/docs/instrumentation/js/getting-started/nodejs/#setup, While having a global S3 configuration, the autoinstrumentation causes an error to be thrown when the AWS_REGION is set to a region that is not us-east-1 or aws-global. The requests will succeed when the instrumentation.js is not required.

Reproduction here: https://github.com/zdelagrange/otel-nodejs-aws-bug

What did you expect to see?

I expected the S3 functionality to not change when using the auto instrumentation

What did you see instead?

S3 functionality changed when using the auto instrumentation causing a 403/failure to be returned when a success returns without using the auto instrumentation:

SignatureDoesNotMatch: The request signature we calculated does not match the signature you provided. Check your key and signing method.
    at Request.extractError (/Users/zdelagrange/src/otel-nodejs-aws-bug/node_modules/aws-sdk/lib/services/s3.js:711:35)
    at Request.callListeners (/Users/zdelagrange/src/otel-nodejs-aws-bug/node_modules/aws-sdk/lib/sequential_executor.js:106:20)
    at Request.emit (/Users/zdelagrange/src/otel-nodejs-aws-bug/node_modules/aws-sdk/lib/sequential_executor.js:78:10)
    at Request.emit (/Users/zdelagrange/src/otel-nodejs-aws-bug/node_modules/aws-sdk/lib/request.js:686:14)
    at Request.transition (/Users/zdelagrange/src/otel-nodejs-aws-bug/node_modules/aws-sdk/lib/request.js:22:10)
    at AcceptorStateMachine.runTo (/Users/zdelagrange/src/otel-nodejs-aws-bug/node_modules/aws-sdk/lib/state_machine.js:14:12)
    at /Users/zdelagrange/src/otel-nodejs-aws-bug/node_modules/aws-sdk/lib/state_machine.js:26:10
    at Request.<anonymous> (/Users/zdelagrange/src/otel-nodejs-aws-bug/node_modules/aws-sdk/lib/request.js:38:9)
    at Request.<anonymous> (/Users/zdelagrange/src/otel-nodejs-aws-bug/node_modules/aws-sdk/lib/request.js:688:12)
    at Request.callListeners (/Users/zdelagrange/src/otel-nodejs-aws-bug/node_modules/aws-sdk/lib/sequential_executor.js:116:18) {
  code: 'SignatureDoesNotMatch',
  region: null,
  time: 2023-07-21T14:49:13.821Z,
  requestId: 'xxx',
  extendedRequestId: 'xxx',
  cfId: undefined,
  statusCode: 403,
  retryable: false,
  retryDelay: 175.4145547667371
}

Additional context

Reproduction here: https://github.com/zdelagrange/otel-nodejs-aws-bug

zdelagrange commented 1 year ago

Tagging component owners @carolabadeer @blumamir

dyladan commented 1 year ago

I think this is a P1 bug. I'll assign that priority for now and the code owner can update it if required.

Ankcorn commented 1 year ago

Hey we ran into this too, its not isolated to s3. Seems to effect requests that are retried too.

Kinesis without retry: https://sandbox.baselime.io/baselime/sandbox/datasets/otel/traces/1d5a0a575ff7e392a8e7f9d47e20c912

Kinesis with retry: https://sandbox.baselime.io/baselime/sandbox/datasets/otel/traces/2c297889cf56f6d97287ed230852416b

This is using these versions

 "@opentelemetry/api": "^1.4.1"
  "@opentelemetry/exporter-trace-otlp-http": "^0.40.0"
  "@opentelemetry/instrumentation": "^0.40.0"
  "@opentelemetry/instrumentation-aws-lambda": "^0.35.2"
  "@opentelemetry/instrumentation-aws-sdk": "^0.34.2"
  "@opentelemetry/instrumentation-http": "^0.40.0"
  "@opentelemetry/sdk-trace-base": "^1.14.0"
  "@opentelemetry/sdk-trace-node": "^1.14.0"

From this https://github.com/baselime/lambda-node-opentelemetry

I am looking into a fix and if it's helpful I can post anything that I find here

zdelagrange commented 1 year ago

Howdy,

Any update on this? Would be good to at least get some triage info from the maintainers if possible...

Would upgrading to aws api v3 fix this?

Thank you !!

ten-ans commented 1 year ago

I ran into this issue during our migration to OpenTelemetry as well, the problem is with instrumentation-http and not the AWS SDK instrumentation specifically.

https://github.com/open-telemetry/opentelemetry-js/blob/a4213183b03664cdf29e622932ea4823a9b8de50/experimental/packages/opentelemetry-instrumentation-http/src/http.ts#L587

The _outgoingRequestFunction function here injects tracing headers into optionsParsed.headers, but that property holds a reference to the same object that the instrumented code passed to the nodejs request or get function's options object, so the headers will leak back out into the caller's internal state. This can cause a wide variety of unexpected behavior.

Regarding the specific issue that we ran into with the AWS SDK making calls to S3, we have a bucket located in us-east-2, and when we try to interact with it AWS sends back a redirect to a different API endpoint. The v2 AWS SDK then retries its request against this other endpoint, and in the process of doing so it needs to update the signature on the request's HTTP headers. An injected traceparent header is leaked out into the object that the AWS SDK's internally uses to track the request headers, and this causes the AWS SDK to include the traceparent header in the retry's request signature. The OpenTelemetry instrumentation then injects a different traceparent on this second request, and this causes the request signature to break.

The solution is simple enough, replace the value of optionsParsed.headers with a clone of the object that it points to. We locally hacked in a patch to this effect and it made the problem go away in our environment. Unfortunately it may take a while before I am able to contribute a proper upstream fix for this issue, but if anybody wants to work that description up into a PR against opentelemetry-js in the meantime then please feel free.

Ankcorn commented 1 year ago

@ten-ans hey I also followed up with this bug on the Aws end. I was surprised that it only affected the V2 sdk and not V3.

I found that the V2 sdk cached the signature it passes adds and does not recompute it on retries.

I raised an issue here and got a simpler work around

AWS.Signers.V4.prototype.unsignableHeaders.push('traceparent');

Which I have tested and confirmed that it works.

Hope this helps

https://github.com/aws/aws-sdk-js/issues/4472#issuecomment-1660786070

carolabadeer commented 1 year ago

Thanks @ten-ans and @Ankcorn for your thorough investigation of this issue!

@Ankcorn from reading through the issue you opened in the AWS SDK repo, it seems that they confirmed it is an issue with the AWS SDK v2 (and does not persist in v3). Is the suggested workaround a change that can be added in an application or an instrumentation change?

Ankcorn commented 1 year ago

Hey @carolabadeer,

Thanks for taking a look at this :) https://github.com/aws/aws-sdk-js/pull/4473 The PR to add traceparent has not been looked at yet.

I also don't know if it totally solves the problem, AFAIK would lambda runtimes still pull in an old and incompatible version of the aws-sdk? This would cause some confusion for new OTEL users.

We might be able to fix this by adding the

AWS.Signers.V4.prototype.unsignableHeaders.push('traceparent'); 

to the AWS SDK instrumentation.

Thoughts?

cartersocha commented 11 months ago

Hey @mhausenblas is there someone from the AWS SDK team we can coordinate with? @martinkuba is going to spend some time here too

trentm commented 10 months ago

The issue -- as @ten-ans pointed out above (https://github.com/open-telemetry/opentelemetry-js-contrib/issues/1609#issuecomment-1671657793) -- is that the HttpInstrumentation modifies the headers object passed into http.request() (or http.get()) in-place to add the trace-context header(s):

http.get(someUrl, { headers: { Foo: 'Bar' } }, (res) => { ... });

Often, calling code won't notice this change. However, in aws-sdk code that retries, it keeps a reference to that headers object, so this is what happens:

  1. The first HTTP request is made with, for example, theHeaders = { Foo: 'Bar' }. AWSv4 request signing will include that "Foo" header, and any others in there, to create a signature for the request.
  2. OTel HTTP Instrumentation changes to theHeaders = { Foo: 'Bar', traceparent: 'a-traceparent-for-this-first-request' }. The request is sent. It fails for some reason, but is retryable.
  3. The AWS SDK then makes whatever request changes it needs for the retry. theHeaders still has the old traceparent value. The AWS SDK now includes both Foo and traceparent headers in its request signing. This would normally be fine, however the order is: aws-sdk uses the old traceparent value for the request signature, then OTel changes the traceparent to a new value for this next request, then the request is sent. The server notices the signature mismatch.

This can be seen in this example:

const { NodeSDK, tracing } = require('@opentelemetry/sdk-node');
const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');
const sdk = new NodeSDK({
  serviceName: 'make-http-req',
  spanProcessor: new tracing.SimpleSpanProcessor(new tracing.ConsoleSpanExporter()),
  instrumentations: [
    new HttpInstrumentation(),
  ]
})
sdk.start();

const http = require('http');
const myHeaders = {
  Foo: 'Bar'
}
http.get('http://localhost:8200/', { headers: myHeaders }, (res) => {
  console.log('response: %s %s', res.statusCode, res.headers);
  res.resume();
  res.on('end', () => {
    console.log('response end');
  });
});
console.log('http.get() called: was myHeaders mutated? %s', myHeaders);

Running that shows that myHeaders is mutated:

% node make-http-req.js
http.get() called: was myHeaders mutated? {
  Foo: 'Bar',
  traceparent: '00-35f4aeff468b2067ae1d5dba82c63ed4-0afcfcdb409c2d13-01'
}
response: 200 {
  date: 'Sat, 25 Nov 2023 00:17:39 GMT',
  connection: 'close',
  'transfer-encoding': 'chunked'
}
{
  traceId: '35f4aeff468b2067ae1d5dba82c63ed4',
  parentId: undefined,
  traceState: undefined,
  name: 'GET',
  id: '0afcfcdb409c2d13',
  kind: 2,
  timestamp: 1700871459100000,
  duration: 11929.505,
  attributes: {
    'http.url': 'http://localhost:8200/',
    'http.method': 'GET',
    'http.target': '/',
    'net.peer.name': 'localhost',
    'http.host': 'localhost:8200',
    'net.peer.ip': '127.0.0.1',
    'net.peer.port': 8200,
    'http.status_code': 200,
    'http.status_text': 'OK',
    'http.flavor': '1.1',
    'net.transport': 'ip_tcp'
  },
  status: { code: 0 },
  events: [],
  links: []
}
response end

I got lucky here, because I hit this before in Elastic's APM agent. A longer write-up is here: https://github.com/elastic/apm-agent-nodejs/issues/2134

It makes sense why -- as noted in https://github.com/open-telemetry/opentelemetry-js/issues/3922 -- that https://github.com/open-telemetry/opentelemetry-js/commit/49787433b66a17a1788a20f3a7edda3aa2580890 changed the behaviour. The header normalization from before that commit was indirectly creating a clone of the headers object.

possible fix

An alternative to https://github.com/open-telemetry/opentelemetry-js-contrib/pull/1813 would be something like this -- again, as @ten-ans suggested above.

diff --git a/experimental/packages/opentelemetry-instrumentation-http/src/http.ts b/experimental/packages/opentelemetry-instrumentation-http/src/http.ts
index 9422bbc9..780f5c32 100644
--- a/experimental/packages/opentelemetry-instrumentation-http/src/http.ts
+++ b/experimental/packages/opentelemetry-instrumentation-http/src/http.ts
@@ -672,6 +672,10 @@ export class HttpInstrumentation extends InstrumentationBase<Http> {

       if (!optionsParsed.headers) {
         optionsParsed.headers = {};
+      } else {
+        // Make a copy of the headers object to avoid mutating an object the
+        // caller might have a reference to.
+        optionsParsed.headers = Object.assign({}, optionsParsed.headers);
       }
       propagation.inject(requestContext, optionsParsed.headers);

@martinkuba Thoughts? Should I make a PR for that?

trentm commented 9 months ago

Should I make a PR for that?

I've added a PR for the alternative fix here: https://github.com/open-telemetry/opentelemetry-js/pull/4346

trentm commented 9 months ago

To be clear, this should be fixed with the next release of @opentelemetry/instrumentation-http (any release after 0.45.1).