open-telemetry / opentelemetry-js-contrib

OpenTelemetry instrumentation for JavaScript modules
https://opentelemetry.io
Apache License 2.0
697 stars 513 forks source link

I'm expecting node aws lambda to export spans properly #647

Closed longility closed 3 years ago

longility commented 3 years ago

What version of OpenTelemetry are you using?

ih-otel-lambda-layer@1.0.0 /Users/longility/Repos/platform/ih-otel-lambda-layer
├── @opentelemetry/api@1.0.2
├── @opentelemetry/auto-instrumentations-node@0.24.0
├── @opentelemetry/exporter-collector-grpc@0.24.0
├── @opentelemetry/exporter-collector-proto@0.24.0
├── @opentelemetry/id-generator-aws-xray@0.24.0
├── @opentelemetry/instrumentation-aws-lambda@0.24.0
├── @opentelemetry/sdk-node@0.24.0
├── @tsconfig/node14@1.0.1
├── opentelemetry-instrumentation-aws-sdk@0.24.1
└── typescript@4.3.5

What version of Node are you using?

v14.17.4

What did you do?

Create a lambda layer where I started from this: https://github.com/open-telemetry/opentelemetry-js#instantiate-tracing

import * as opentelemetry from "@opentelemetry/sdk-node";
import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";
import { CollectorTraceExporter } from "@opentelemetry/exporter-collector-proto";
import { AwsInstrumentation } from "opentelemetry-instrumentation-aws-sdk";
import { AWSXRayIdGenerator } from "@opentelemetry/id-generator-aws-xray";
import { BatchSpanProcessor } from "@opentelemetry/tracing";
import { AwsLambdaInstrumentation } from "@opentelemetry/instrumentation-aws-lambda";
import { diag, DiagConsoleLogger } from "@opentelemetry/api";
import { getEnv } from "@opentelemetry/core";

// https://github.com/open-telemetry/opentelemetry-js-api/issues/102
diag.setLogger(new DiagConsoleLogger(), getEnv().OTEL_LOG_LEVEL);

const traceExporter = new CollectorTraceExporter();

const sdk = new opentelemetry.NodeSDK({
  traceExporter,
  instrumentations: [
    new AwsLambdaInstrumentation(),
    new AwsInstrumentation(),
    getNodeAutoInstrumentations(),
  ],
});

// https://aws-otel.github.io/docs/getting-started/js-sdk/trace-manual-instr#configure-global-tracer-for-aws-x-ray
sdk.configureTracerProvider(
  {
    idGenerator: new AWSXRayIdGenerator(),
  },
  new BatchSpanProcessor(traceExporter)
);

// initialize the SDK and register with the OpenTelemetry API
// this enables the API to record telemetry
sdk
  .start()
  .then(() => console.log("Tracing initialized"))
  .catch((error) => console.log("Error initializing tracing", error));

// gracefully shut down the SDK on process exit
process.on("SIGTERM", () => {
  sdk
    .shutdown()
    .then(() => console.log("Tracing terminated"))
    .catch((error) => console.log("Error terminating tracing", error))
    .finally(() => process.exit(0));
});

What did you expect to see?

I was expecting to see statusCode: 200 to indicate that it got exported correctly. I saw this in the debug log when I was running a server, not lambda function. Also, I am not seeing any associated traces. I am testing against the same collector.

What did you see instead?

Here is the actual log

@opentelemetry/api: Registered a global for diag v1.0.2.
--
Loading instrumentation for @opentelemetry/instrumentation-dns
Loading instrumentation for @opentelemetry/instrumentation-express
Loading instrumentation for @opentelemetry/instrumentation-graphql
Loading instrumentation for @opentelemetry/instrumentation-grpc
Loading instrumentation for @opentelemetry/instrumentation-http
Loading instrumentation for @opentelemetry/instrumentation-ioredis
Loading instrumentation for @opentelemetry/instrumentation-koa
Loading instrumentation for @opentelemetry/instrumentation-mongodb
Loading instrumentation for @opentelemetry/instrumentation-mysql
Loading instrumentation for @opentelemetry/instrumentation-pg
Loading instrumentation for @opentelemetry/instrumentation-redis
2021-08-27T21:10:45.953Z    undefined   DEBUG   @opentelemetry/instrumentation-http Applying patch for http@14.17.4
2021-08-27T21:10:45.968Z    undefined   DEBUG   @opentelemetry/instrumentation-http Applying patch for https@14.17.4
2021-08-27T21:10:46.611Z    undefined   DEBUG   Applying patch for express@4.17.1
2021-08-27T21:10:46.919Z    undefined   DEBUG   aws-sdk instrumentation: applying patch to aws-sdk v3 constructStack
2021-08-27T21:10:46.937Z    undefined   DEBUG   aws-sdk instrumentation: applying patch to aws-sdk v3 client send
2021-08-27T21:10:47.457Z    undefined   DEBUG   Applying patch for lambda handler
2021-08-27T21:10:47.457Z    undefined   DEBUG   patch handler function
2021-08-27T21:10:47.462Z    undefined   DEBUG   EnvDetector found resource. Resource {  attributes: {    'service.namespace': 'platform',    'service.name': 'playground-profile',    'service.version': '7c21b2955db42c81020bc11a693629cda5d95396'  }}
2021-08-27T21:10:47.462Z    undefined   DEBUG   ProcessDetector found resource. Resource {  attributes: {    'process.pid': 9,    'process.executable.name': '/var/lang/bin/node',    'process.command': '/var/runtime/index.js',    'process.command_line': '/var/lang/bin/node /var/runtime/index.js'  }}
2021-08-27T21:10:47.464Z    undefined   DEBUG   AwsEc2Detector failed: EC2 metadata api request timed out.
2021-08-27T21:10:47.465Z    undefined   DEBUG   GcpDetector failed: GCP Metadata unavailable.
2021-08-27T21:10:47.465Z    undefined   DEBUG   GcpDetector found resource. Resource { attributes: {} }
2021-08-27T21:10:47.465Z    undefined   TRACE   {  'service.name': 'playground-profile',  'service.namespace': 'platform',  'service.version': '7c21b2955db42c81020bc11a693629cda5d95396'}
2021-08-27T21:10:47.465Z    undefined   TRACE   {  'process.command': '/var/runtime/index.js',  'process.command_line': '/var/lang/bin/node /var/runtime/index.js',  'process.executable.name': '/var/lang/bin/node',  'process.pid': 9}
2021-08-27T21:10:47.467Z    undefined   DEBUG   @opentelemetry/api: Registered a global for trace v1.0.2.
2021-08-27T21:10:47.469Z    undefined   DEBUG   @opentelemetry/api: Registered a global for context v1.0.2.
2021-08-27T21:10:47.469Z    undefined   DEBUG   @opentelemetry/api: Registered a global for propagation v1.0.2.
2021-08-27T21:10:47.470Z    undefined   INFO    Tracing initialized
START RequestId: 08e980e0-95d4-4786-a349-244b666ede74 Version: $LATEST
END RequestId: 08e980e0-95d4-4786-a349-244b666ede74
REPORT RequestId: 08e980e0-95d4-4786-a349-244b666ede74  Duration: 397.18 ms Billed Duration: 398 ms Memory Size: 128 MB Max Memory Used: 128 MB Init Duration: 2099.94 ms   XRAY TraceId: 1-612954d5-516183ac74c9ce0079d6a7b1   SegmentId: 7b92173a4d75fd9e Sampled: true
START RequestId: 038f80ac-612a-4605-bed7-215f046c5881 Version: $LATEST
2021-08-27T21:10:49.470Z    038f80ac-612a-4605-bed7-215f046c5881    DEBUG   @opentelemetry/instrumentation-http %s instrumentation outgoingRequest https
2021-08-27T21:10:49.529Z    038f80ac-612a-4605-bed7-215f046c5881    DEBUG   @opentelemetry/instrumentation-http http.ClientRequest return request
2021-08-27T21:10:49.810Z    038f80ac-612a-4605-bed7-215f046c5881    DEBUG   @opentelemetry/instrumentation-http outgoingRequest on response()
2021-08-27T21:10:49.850Z    038f80ac-612a-4605-bed7-215f046c5881    DEBUG   @opentelemetry/instrumentation-http outgoingRequest on end()
END RequestId: 038f80ac-612a-4605-bed7-215f046c5881
REPORT RequestId: 038f80ac-612a-4605-bed7-215f046c5881  Duration: 2067.00 ms    Billed Duration: 2067 ms    Memory Size: 128 MB Max Memory Used: 128 MB XRAY TraceId: 1-612954d8-28807c1012c024c619439d9b   SegmentId: 3ef94af35d3ab4f9 Sampled: true

Additional context

Any ideas how to debug or understand what the issue is on why the spans are not exporting?

I was lead from here: https://github.com/open-telemetry/opentelemetry-js/issues/1295#issuecomment-907476081 /cc @willarmiros @anuraaga

anuraaga commented 3 years ago

Hi @longility - how did you build the lambda layer? You didn't mention it so I wonder if you didn't check out opentelemetry-lambda.

https://github.com/open-telemetry/opentelemetry-lambda

Also we have a customization to export to x-ray here if it provides more pointers

https://github.com/aws-observability/aws-otel-lambda

longility commented 3 years ago

Hey @anuraaga thanks for getting back and commenting

how did you build the lambda layer?

In short, the lambda layer I created is built by using this as reference: https://github.com/open-telemetry/opentelemetry-lambda/tree/main/nodejs/packages/layer

I went through many attempts to get to where I am at. My latest attempt is to have the lambda layer export to a container with a base image of amazon/aws-otel-collector that exports to x-ray and honeycomb. I am essentially decoupling the collector from the lambda function/layer to verify and get working in isolation. However, the spans are NOT being exported. Hence my conclusion on the issue.

I tested my collector by deploying the container into an ecs cluster and ran a local graphql server against that remote collector, and that works as expected. I see the expected spans such as graphQL and aws sdk spans from dynamodb.

Another reason why I think it has to do with spans not being exported and not the connection to the remote collector is that when I have a bad OTEL endpoint, it would log the errors as I have enabled logging to see that locally. I am not receiving the error logs or the success logs. I also logged the endpoint to make sure it is the same endpoint I used locally as I am using for the lambda to double check and the endpoint was the same.

AWS ADOT node lambda layer attempt

Followed this: https://aws-otel.github.io/docs/getting-started/lambda/lambda-js

Latest conclusion: I did not see aws sdk or graphQL spans so I think the spans are not being exported through otel instrumentation and otel collector, but I think the traces I see in x-ray are being exported through whatever built-in x-ray instrumentation, not otel. That is why I decided to go through my latest experiment. Basically, I thinking the ADOT node lambda layer is not working b/c the spans are not being exported perhaps like the issue I'm theorizing.

Alternative attempt

I was wondering if x-ray is the issue so I made an attempt to export to honeycomb following this: https://aws-otel.github.io/docs/getting-started/lambda#custom-configuration-for-the-adot-collector-on-lambda

However, my latest attempt I decoupled the collector to point to both honeycomb and x-ray and they work as I expect as mentioned earlier. so I have the collector side working to where I conclude it is the lambda part of exporting spans.

My next debugging and exploring

On Monday, I think I will copy in AWS lambda instrumentation so I can do more logging to see what paths it is taking to understand how that instrumentation works. I'm new to all this so I'm slowly have to unravel to learn and debug. I am under the impression that the AWS lambda instrumentation has an expected responsibility to force flush the spans before the function terminates. That is why I'm going down this path to see where it ends up. Additionally, I may attempt to short circuit to force flush to see if it goes to the collector to further identify what is working and what is not working.

@anuraaga do you have any other pointers? Not sure if you tried recently to test out ADOT node lambda layer to have good confidence that it should be working.

anuraaga commented 3 years ago

In short, the lambda layer I created is built by using this as reference

If possible we need the long, not the short ;) So it sounds like you are building a layer from scratch, without the collector. And you are using the same exec wrapper technique? What changes have you made to the code?

Is the collector running externally on ECS and the lambda exports to that?

AWS lambda instrumentation has an expected responsibility to force flush the spans before the function terminates.

The instrumentation force flushes so user code doesn't have to call it.

Not sure if you tried recently to test out ADOT node lambda layer to have good confidence that it should be working.

We run canary tests periodically which verify that spans for aws SDK are sent to xray. So that is working for the public lambda layers. However honeycomb isn't yet supported on the public layer which might be why you didn't see spans there? But if you saw problems with the x-ray export can you share the relevant parts of the lambda function definition?

longility commented 3 years ago

Yes, I’m using the same wrapper technique.

I think I found one of the issues. The BasicTracerProvider dependency that is created and the dependency that is in the instrumentation has a pkg mismatch so when we check instanceof, it is false and so it does not force flush.

It is related to this refactor: https://github.com/open-telemetry/opentelemetry-js/pull/2340#issuecomment-887313662

I’m not sure what the fix should be due to the interesting coupling that may not be straightforward for me such as the coupling to the dependency versions the node lambda layer is referencing.

anuraaga commented 3 years ago

@longility Are you calling the OpenTelemetry API in your function, or just need it auto-instrumented? I think if you just use the same versions as the lambda layer's package.json (notably using 0.24.0 of opentelemetry/node) it should work ok. If you use the API in your function, you'd need to make sure it's 1.0.1 to match the layer.

https://github.com/open-telemetry/opentelemetry-lambda/blob/main/nodejs/packages/layer/package.json#L29

longility commented 3 years ago

Yes, I’m using the API. I’ll double check that tomorrow as I try to get things working. Thanks!

I’m still new and challenged by so many packages! The debugging exercise is forcing me to learn.

longility commented 3 years ago

I'm going to take that back on my conclusion of version mismatch. I think what I have in code is fine. I think I got confused as I was jumping between what is latest in github and what is on my local.

willarmiros commented 3 years ago

@longility one other question - are you enabling the AWS X-Ray propagator as described here: https://github.com/open-telemetry/opentelemetry-js-contrib/blob/main/propagators/opentelemetry-propagator-aws-xray/README.md?

longility commented 3 years ago

@longility one other question - are you enabling the AWS X-Ray propagator as described here: https://github.com/open-telemetry/opentelemetry-js-contrib/blob/main/propagators/opentelemetry-propagator-aws-xray/README.md?

I am not. I just double checked as I thought that was added by the aws lambda instrumentation. I was wrong. It just has a reference but it doesn't enable it. Thanks!

longility commented 3 years ago

@anuraaga I found the issue as with a code change, I was able to get working. It has to do with this line: https://github.com/open-telemetry/opentelemetry-js-contrib/blob/ee0debf9a1ba3889b480a36a8b5c7ce04d8f35aa/plugins/node/opentelemetry-instrumentation-aws-lambda/src/instrumentation.ts#L270

What I determined for at least my case is that the setTracerProvider is passing in a ProxyTracerProvider.

Which I created a helper method to get BasicTracerProvider like so:

private _getBasicTracerProvider() {
  let realTracerProvider = this._tracerProvider;
  if (this._tracerProvider instanceof ProxyTracerProvider) {
    realTracerProvider = this._tracerProvider.getDelegate();
  }

  return realTracerProvider instanceof BasicTracerProvider
    ? realTracerProvider
    : undefined;
}

then the caller looks like this now

const basicTracerProvider = this._getBasicTracerProvider();
if (basicTracerProvider) {
  basicTracerProvider
    .getActiveSpanProcessor()
    .forceFlush()
    .then(
      () => callback(),
      () => callback()
    );
} else {
  callback();
}

If this makes sense, I can put in a PR for it.

anuraaga commented 3 years ago

Thanks for the dig @longility! @dyladan do you have any suggestion on how to invoke forceFlush from the instrumentation? Is there a better pattern than having to check instanceof for both BasicTracerProvider and ProxyTracerProvider?

Flarna commented 3 years ago

I think we should add forceFlush to API and as a result avoid such unwanted dependencies between instrumentation and SDK. But I'm not sure if this is allowed from spec.

Even with above fix you could easily end up in problems if someone implements it's own TracerProvider which would support getActiveSpanProcessor().

longility commented 3 years ago

Should this problem be occurring for a subset of use cases?

I didn’t customize much in regards to tracer provider and it uses the ProxyTracerProvider, which seem to be a common approach. However, I may be wrong. So what I’m wondering is…what is different? I am using an upgraded version. I didn’t try investigating further yet.

dyladan commented 3 years ago

Thanks for the dig @longility! @dyladan do you have any suggestion on how to invoke forceFlush from the instrumentation? Is there a better pattern than having to check instanceof for both BasicTracerProvider and ProxyTracerProvider?

No. I tried to add a force flush to the spec a while ago but it didn't get enough support.

longility commented 3 years ago

From my quick investigation with limited knowledge, it seems to have been ProxyTracerProvider for at least the past year if not from beginning.

With that said, I probably recommend getting the fix in even if it isn’t ideal. Seems critical. Not sure how come it hasn’t come up before. I don’t have enough context and history to piece this together.

dyladan commented 3 years ago

Thanks for the dig @longility! @dyladan do you have any suggestion on how to invoke forceFlush from the instrumentation? Is there a better pattern than having to check instanceof for both BasicTracerProvider and ProxyTracerProvider?

I have 2 suggestions which would not require spec changes:

I know both of these aren't great solutions. I would far prefer to have a forceFlush function in the API, but that would require a spec change.

longility commented 3 years ago

@dyladan

Instead of instanceof you can just check for typeof === 'function' on the important methods

If the BasicTracerProvider has .getActiveSpanProcessor() and the ProxyTracerProvider does not, I'm not sure how that would work unless I'm mistaken on what you meant.

Add a flush function configuration option, or a tracerprovider configuration option to the constructor

Can you elaborate on this as I was not able to follow? I'm not sure whose constructor you are referring to. AwsLambdaInstrumentation's constructor, or something else?

dyladan commented 3 years ago

@dyladan

Instead of instanceof you can just check for typeof === 'function' on the important methods

If the BasicTracerProvider has .getActiveSpanProcessor() and the ProxyTracerProvider does not, I'm not sure how that would work unless I'm mistaken on what you meant.

I was suggesting you do something very similar to the code you linked above but instead of using instanceof checks which sometimes fail for unexpected reasons you can check for functions. But you may decide you like instanceof checks better since they're more explicit.

private _getBasicTracerProvider() {
  let realTracerProvider = this._tracerProvider;
  if (typeof this._tracerProvider.getDelegate === 'function') {
    realTracerProvider = this._tracerProvider.getDelegate();
  }

  if (typeof realTracerProvider.getActiveSpanProcessor === 'function') {
    return realTracerProvider
  }

  return undefined;
}

Add a flush function configuration option, or a tracerprovider configuration option to the constructor

Can you elaborate on this as I was not able to follow? I'm not sure whose constructor you are referring to. AwsLambdaInstrumentation's constructor, or something else?

I was suggesting the aws instrumentation config options be extended to have a flush option, then when it is configured you can pass a function which has the tracerprovider in closure.

Something along the lines of this:

const tp = new BasicTracerProvider(opts);

tp.addSpanProcessor(spanProcessor);

const awsInstr = new AwsLambdaInstrumenatation({
    flush: async () => {
        await tp.getActiveSpanProcessor().forceFlush();
    },
})
dyladan commented 3 years ago

Honestly, I would much rather have a force flush in the API as I've already said above.

Flarna commented 3 years ago

We could avoid the binding of instrumentations to SDK by adding above flush code in @opentelemetry/instrumentation. This would at least avoid to re-implement above hacky stuff in several instrumentations.

longility commented 3 years ago

We could avoid the binding of instrumentations to SDK by adding above flush code in @opentelemetry/instrumentation. This would at least avoid to re-implement above hacky stuff in several instrumentations.

While I like the approach if there are many scenarios and we want to make it easier explicit and be DRY, I’m wondering how many scenarios we have to need to force flush. Force flush in itself seems not ideal, and it just so happens that lambda may be the only special case that I can see. So maybe unless there is more need, hide the nastiness in the instrumentation that is needed. In this case, aws lambda instrumentation.

Flarna commented 3 years ago

I guess all FaaS environments would need this not just AWS lambda. And maybe browers in case the host PC goes into suspend but in that case it's most likely no instrumentation which handles this.

longility commented 3 years ago

@Flarna Thanks for adding details to my gap knowledge.

If that is the case and if I understand you correctly, then it would look something like this...

https://github.com/open-telemetry/opentelemetry-js/blob/main/packages/opentelemetry-instrumentation/src/types.ts

export interface Instrumentation {
  // ..
  // removed for brevity

  forceFlushSpans(): Promise<void>;
}
longility commented 3 years ago

As for implementation detail, I’m thinking we should force flush if available. if not available, then diag.error and promise resolve, not reject. Something like that?

One can spin their wheels wondering why the spans are not exported like me.

willarmiros commented 3 years ago

@longility have you taken a crack at a PR here? If not, I might try to implement @dyladan's simpler suggestion to just do type checking on the ProxyTracerProvider and extract the real TracerProvider if needed. While I see that the forceFlush issue might apply to other instrumentations later, I tend to agree with @longility that we can deal with it internally in the AWS Lambda instrumentation for now before adding an edge-case field to the Instrumentation API. We can always revert it and go with the long-term fix via an API change later.

I've also confirmed that this bug is what's causing tests to fail in #660, so I'd like to get a fix out ASAP and will probably make that PR tomorrow.

longility commented 3 years ago

@willarmiros I didn't know the direction so I didn't create a PR. I just copied in the code and made it work for us. I can send a quick PR to get things started.

tekkeon commented 2 years ago

I just read over this whole conversation because I am currently facing a similar issue, trying to instrument an Apollo GraphQL Server Lambda function and getting no exported spans. Based on what I've read here it seems like this issue should have been resolved, and I made sure I'm using the most up-to-date packages. I detailed what I've tried and a link to a repo with a demo in this issue under Apollo Server's repo (which I realized after the fact is more likely an issue here rather than there): https://github.com/apollographql/apollo-server/issues/5923

Please let me know if I am doing something incorrectly! I can also open a new issue if it seems like this has a different root cause, just seemed pretty heavily related to this issue so figured I'd comment here first.

longility commented 2 years ago

@mkossoris I got your exact scenario working. What do you see in your lambda logs? You may need more logging for me to help.

tekkeon commented 2 years ago

@longility Interesting, I went ahead and re-deployed the repo to a new AWS account just to make sure I was starting from a clean slate. I am still not seeing the trace data in the logs.

Here is some of the output I see when I run the local server:

...
{
  traceId: '184db6230d94288669a438f8e2d7630d',
  parentId: '5f4eb030569e0c15',
  name: 'middleware - <anonymous>',
  id: '7d6d2d427b4aa566',
  kind: 0,
  timestamp: 1638219365725886,
  duration: 27,
  attributes: {
    'http.route': '/',
    'express.name': '<anonymous>',
    'express.type': 'middleware'
  },
  status: { code: 0 },
  events: []
}
{
  traceId: '184db6230d94288669a438f8e2d7630d',
  parentId: undefined,
  name: 'POST /graphql',
  id: '5f4eb030569e0c15',
  kind: 1,
  timestamp: 1638219365720265,
  duration: 12641,
  attributes: {
    'http.url': 'http://localhost:3000/graphql',
    'http.host': 'localhost:3000',
    'net.host.name': 'localhost',
    'http.method': 'POST',
    'http.target': '/graphql',
    'http.request_content_length_uncompressed': 1811,
    'http.flavor': '1.1',
    'net.transport': 'ip_tcp',
    'net.host.ip': '::1',
    'net.host.port': 3000,
    'net.peer.ip': '::1',
    'net.peer.port': 56183,
    'http.status_code': 200,
    'http.status_text': 'OK',
    'http.route': ''
  },
  status: { code: 1 },
  events: []
}

As you can see, the trace data shows up properly when running it locally. And here is the output I see in the CloudWatch logs when running it on a deployed Lambda:

2021-11-29T12:53:18.533-08:00 | START RequestId: 67ead200-83ad-4745-8e57-b6e4b34ef276 Version: $LATEST
2021-11-29T12:53:18.544-08:00 | END RequestId: 67ead200-83ad-4745-8e57-b6e4b34ef276
2021-11-29T12:53:18.544-08:00 | REPORT RequestId: 67ead200-83ad-4745-8e57-b6e4b34ef276 Duration: 7.06 ms Billed Duration: 8 ms Memory Size: 1500 MB Max Memory Used: 108 MB

So none of the trace data seems to be exporting when running within the Lambda. What logs are you seeing when running it on Lambda?

longility commented 2 years ago

Can you enable otel logging so we have a better idea?

import { diag, DiagConsoleLogger } from "@opentelemetry/api";
import { getEnv } from "@opentelemetry/core";

diag.setLogger(new DiagConsoleLogger(), getEnv().OTEL_LOG_LEVEL);

env var to enable

OTEL_LOG_LEVEL=all
longility commented 2 years ago

Actually, I forgot, we created our own lambda layer and not rely on what aws provided. Not sure if you went that route.

willarmiros commented 2 years ago

@mkossoris do you have active tracing enabled on your Lambda? If not, have you manually added X-Ray permissions to your function role? Also, are you setting a different endpoint for the OTel collector via env var?

tekkeon commented 2 years ago

@longility Here is the result of the adding the otel logging:

2021-11-29T16:36:32.859-08:00 | 2021-11-30T00:36:32.858Z undefined DEBUG @opentelemetry/api: Registered a global for diag v1.0.3.
  | 2021-11-29T16:36:32.864-08:00 | 2021-11-30T00:36:32.863Z undefined DEBUG @opentelemetry/api: Registered a global for trace v1.0.3.
  | 2021-11-29T16:36:32.864-08:00 | 2021-11-30T00:36:32.864Z undefined DEBUG @opentelemetry/api: Registered a global for context v1.0.3.
  | 2021-11-29T16:36:32.864-08:00 | 2021-11-30T00:36:32.864Z undefined DEBUG @opentelemetry/api: Registered a global for propagation v1.0.3.
  | 2021-11-29T16:36:32.870-08:00 | 2021-11-30T00:36:32.869Z undefined DEBUG @opentelemetry/instrumentation-http Applying patch for http@14.18.1
  | 2021-11-29T16:36:32.903-08:00 | 2021-11-30T00:36:32.903Z undefined DEBUG @opentelemetry/instrumentation-http Applying patch for https@14.18.1
  | 2021-11-29T16:36:33.011-08:00 | START RequestId: 66d6596b-56ae-4187-82eb-bf7d6860cc57 Version: $LATEST
  | 2021-11-29T16:36:33.027-08:00 | END RequestId: 66d6596b-56ae-4187-82eb-bf7d6860cc57
  | 2021-11-29T16:36:33.027-08:00 | REPORT RequestId: 66d6596b-56ae-4187-82eb-bf7d6860cc57 Duration: 15.51 ms Billed Duration: 16 ms Memory Size: 1500 MB Max Memory Used: 94 MB Init Duration: 493.64

The debug logs are only displayed on startup - further requests only show the START, END, REPORT, and Billed Duration logs from Lambda.

@willarmiros and @longility For this particular test, I'm trying to remove variables by just registering the instrumentation myself in the Lambda function and using the ConsoleExporter - not the CollectorExporter and not using the Lambda extension. I have also tried it with the the Lambda extension and had no luck with that.

tekkeon commented 2 years ago

Here are some logs from running it just with the Lambda Layer:

  | 2021-11-29T16:46:02.843-08:00 | 2021-11-30T00:46:02.842Z info builder/receivers_builder.go:68 Receiver is starting... {"kind": "receiver", "name": "otlp"}
  | 2021-11-29T16:46:02.843-08:00 | 2021-11-30T00:46:02.843Z info otlpreceiver/otlp.go:68 Starting GRPC server on endpoint 0.0.0.0:4317 {"kind": "receiver", "name": "otlp"}
  | 2021-11-29T16:46:02.843-08:00 | 2021-11-30T00:46:02.843Z info otlpreceiver/otlp.go:86 Starting HTTP server on endpoint 0.0.0.0:4318 {"kind": "receiver", "name": "otlp"}
  | 2021-11-29T16:46:02.843-08:00 | 2021-11-30T00:46:02.843Z info otlpreceiver/otlp.go:141 Setting up a second HTTP listener on legacy endpoint 0.0.0.0:55681 {"kind": "receiver", "name": "otlp"}
  | 2021-11-29T16:46:02.843-08:00 | 2021-11-30T00:46:02.843Z info otlpreceiver/otlp.go:86 Starting HTTP server on endpoint 0.0.0.0:55681 {"kind": "receiver", "name": "otlp"}
  | 2021-11-29T16:46:02.843-08:00 | 2021-11-30T00:46:02.843Z info builder/receivers_builder.go:73 Receiver started. {"kind": "receiver", "name": "otlp"}
  | 2021-11-29T16:46:02.845-08:00 | 2021-11-30T00:46:02.843Z info service/telemetry.go:92 Setting up own telemetry...
  | 2021-11-29T16:46:02.845-08:00 | 2021-11-30T00:46:02.844Z info service/telemetry.go:116 Serving Prometheus metrics {"address": ":8888", "level": "basic", "service.instance.id": "d40872f8-9119-4557-900e-986eb5c125f8", "service.version": "latest"}
  | 2021-11-29T16:46:02.845-08:00 | 2021-11-30T00:46:02.844Z info service/collector.go:230 Starting otelcol... {"Version": "v0.1.0", "NumCPU": 2}
  | 2021-11-29T16:46:02.848-08:00 | 2021-11-30T00:46:02.844Z info service/collector.go:132 Everything is ready. Begin running and processing data.
  | 2021-11-29T16:46:02.848-08:00 | 2021/11/30 00:46:02 Registered extension ID: "e1e6e8a0-c361-4cad-a4a6-d9040d111dd3"
  | 2021-11-29T16:46:02.848-08:00 | 2021/11/30 00:46:02 [collector] Register response: {
  | 2021-11-29T16:46:02.848-08:00 | "functionName": "<FUNCTION_NAME>",
  | 2021-11-29T16:46:02.848-08:00 | "functionVersion": "$LATEST",
  | 2021-11-29T16:46:02.848-08:00 | "handler": "index.handler"
  | 2021-11-29T16:46:02.848-08:00 | }
  | 2021-11-29T16:46:03.297-08:00 | 2021/11/30 00:46:02 [collector] Waiting for event...
  | 2021-11-29T16:46:03.597-08:00 | Registering OpenTelemetry
  | 2021-11-29T16:46:03.597-08:00 | EXTENSION Name: collector State: Ready Events: [INVOKE,SHUTDOWN]
  | 2021-11-29T16:46:03.599-08:00 | 2021/11/30 00:46:03 [collector] Received event: {
  | 2021-11-29T16:46:03.599-08:00 | "eventType": "INVOKE",
  | 2021-11-29T16:46:03.599-08:00 | "deadlineMs": 1638233166598,
  | 2021-11-29T16:46:03.599-08:00 | "requestId": "3bfedfe8-5f5f-45a6-849d-71b540a42896",
  | 2021-11-29T16:46:03.599-08:00 | "invokedFunctionArn": "<FUNCTION_ARN>",
  | 2021-11-29T16:46:03.599-08:00 | "tracing": {
  | 2021-11-29T16:46:03.599-08:00 | "type": "X-Amzn-Trace-Id",
  | 2021-11-29T16:46:03.599-08:00 | "value": "Root=1-61a5744a-6644bc3f2571b9bb356781b7;Parent=5e8721674fab873d;Sampled=1"
  | 2021-11-29T16:46:03.599-08:00 | }
  | 2021-11-29T16:46:03.599-08:00 | }
  | 2021-11-29T16:46:03.599-08:00 | 2021/11/30 00:46:03 [collector] Waiting for event...
  | 2021-11-29T16:46:03.886-08:00 | END RequestId: 3bfedfe8-5f5f-45a6-849d-71b540a42896
  | 2021-11-29T16:46:03.886-08:00 | REPORT RequestId: 3bfedfe8-5f5f-45a6-849d-71b540a42896 Duration: 287.92 ms Billed Duration: 288 ms Memory Size: 1500 MB Max Memory Used: 148 MB Init Duration: 941.74 ms XRAY TraceId: 1-61a5744a-6644bc3f2571b9bb356781b7 SegmentId: 5e8721674fab873d Sampled: true
  | 2021-11-29T16:46:03.927-08:00 | START RequestId: bc1f6210-f4f9-4744-9034-4fc2797e08df Version: $LATEST
  | 2021-11-29T16:46:03.930-08:00 | 2021/11/30 00:46:03 [collector] Received event: {
  | 2021-11-29T16:46:03.930-08:00 | "eventType": "INVOKE",
  | 2021-11-29T16:46:03.930-08:00 | "deadlineMs": 1638233166929,
  | 2021-11-29T16:46:03.930-08:00 | "requestId": "bc1f6210-f4f9-4744-9034-4fc2797e08df",
  | 2021-11-29T16:46:03.930-08:00 | "invokedFunctionArn": "<FUNCTION_ARN>",
  | 2021-11-29T16:46:03.930-08:00 | "tracing": {
  | 2021-11-29T16:46:03.930-08:00 | "type": "X-Amzn-Trace-Id",
  | 2021-11-29T16:46:03.930-08:00 | "value": "Root=1-61a5744b-724089b277f6b495264d64d9;Parent=7c56ed2b60cefdf2;Sampled=1"
  | 2021-11-29T16:46:03.930-08:00 | }

With those logs, we can see that the Lambda layer is clearly running, but it is only collecting traces about the Lambda invocation and not the GraphQL spans. And this trace data does show in XRay so it isn't a permissions issue between the Lambda function and XRay.

NathanielRN commented 2 years ago

Hi @mkossoris

I'm trying to remove variables by just registering the instrumentation myself in the Lambda function and using the ConsoleExporter

I think this is a great way to debug! I've personally found it useful to confirm that you can see the traces in the Lambda logs first before you expect them to arrive at the Collector and then to the AWS X-Ray console.

Could you share the code you use to initialize OTel JS? Particularly, do you register the instrumentation for GraphQL? I can see from your debug logs above that your OTel JS seems to be instrumenting the http and https packages:

  | 2021-11-29T16:36:32.870-08:00 | 2021-11-30T00:36:32.869Z undefined DEBUG @opentelemetry/instrumentation-http Applying patch for http@14.18.1
  | 2021-11-29T16:36:32.903-08:00 | 2021-11-30T00:36:32.903Z undefined DEBUG @opentelemetry/instrumentation-http Applying patch for https@14.18.1

Maybe you could try making a request with the http library and seeing if you still do not get any traces? Something like

http.get('http://aws.amazon.com', () => {});

Also not sure if you are aware but we have documentation for using the published ADOT Lambda for NodeJS Lambda Layer (i.e. arn:aws:lambda:\<region>:901920570463:layer:aws-otel-nodejs-ver-1-0-0:1) which shows you how to set up auto instrumentation if you wanted to compare that against your configuration?

tekkeon commented 2 years ago

Hey @NathanielRN! I've created a repo with the code I've been using here: https://github.com/mkossoris/apollo-server-lambda-otel-issue. It does register the GraphQL instrumentation here: https://github.com/mkossoris/apollo-server-lambda-otel-issue/blob/master/src/open-telemetry.ts#L20.

I'll definitely try adding the http call just to see what happens. As for the ADOT Lambda configuration I tried, you can see the CDK code for that here: https://github.com/mkossoris/apollo-server-lambda-otel-issue/blob/master/cdk/lib/cdk-stack.ts#L10

longility commented 2 years ago

@mkossoris I couldn't make too much sense from log.

However, looking at your code, know that opentelemtry-js does monkeypatching based on actual package names so if you use esbuild through NodejsFunction construct, it would minify and not able to instrument properly.

Secondly, I actually have not gotten the Otel bootstrapping to work within lambda fn, only lambda layer extension.

Thirdly, tracing needs to be enabled as mentioned by @willarmiros even if you aren't using x-ray.

tekkeon commented 2 years ago

@longility I was just coming back here to say I found out it was because of using NodejsFunction construct after trying it with standard Lambda function instead... facepalm.

That being said, the extension still did not work entirely by itself as far as pushing it to XRay goes. I was, however, able to get it to work by importing the open-telemetry file from my example and replacing the ConsoleExporter with the CollectorTraceExporter, which pushes the traces to the Lambda Extension's Collector port, and that did actually work.

Progress at least!

tekkeon commented 2 years ago

So then that leaves me with an important question: is there no way for us to be able to bundle our code and use OpenTelemetry's instrumentations? With Lambda size limits and how large node_modules dir can be, bundling is going to be pretty essential for us.

tekkeon commented 2 years ago

I see there's this issue that discusses OpenTelemetry's issues with esbuild: https://github.com/aws-observability/aws-otel-lambda/issues/99

At the end, the creator of the ticket says:

I will close the ticket unless you think of a different workaround.

Which sort of implies there was some workaround at all? Or is there no known workaround for using esbuild or webpack with OTel instrumentations at this time?

The only thing I can think of off the top of my head is to mark the OTel instrumentations and their related packages (e.x. @open-telemetry/instrumentation-graphql and graphql) as externals and upload the external packages unbundled in node_modules or in a layer. Haven't tested if this works yet though.

longility commented 2 years ago

As far as I know, there is no workaround for minify problem with Otel js monkey patch at this time. You would have to take the node_modules size hit.

Since you are dealing with GraphQL and lambda, I assume you would compose with other lambda functions with Apollo federation. Are there concerns for you with latency for end user? I'm asking to think long term if lambda is the solution as we have been asking these types of questions on our end.

tekkeon commented 2 years ago

@longility Yep that's correct, we're intending to follow a similar design to what IMDb is using, which they posted an article about here: https://aws.amazon.com/blogs/compute/building-federated-graphql-on-aws-lambda/. It uses a Lambda to host the gateway and one Lambda per subgraph.

tekkeon commented 2 years ago

@longility So this is a pretty hacky initial attempt, but it surprisingly works lol:

    const oTelLambdaLayer = lambda.LayerVersion.fromLayerVersionArn(
      this,
      "oTelLambdaLayer",
      "arn:aws:lambda:us-west-2:901920570463:layer:aws-otel-nodejs-ver-1-0-0:1"
    );

    const server = new NodejsFunction(this, "ServerFunction", {
      entry: "../src/server.ts",
      handler: "handler",
      memorySize: 1500,
      environment: {
        OTEL_LOG_LEVEL: "all",
        AWS_LAMBDA_EXEC_WRAPPER: "/opt/otel-handler",
      },
      layers: [oTelLambdaLayer],
      tracing: lambda.Tracing.ACTIVE,
      bundling: {
        commandHooks: {
          afterBundling(inputDir, outputDir) {
            // Manually copy OpenTelemetry packages and OTel-instrumented packages to output dir
            return [
              `cp ../build/open-telemetry.js ${outputDir}`,
              `mkdir ${outputDir}/node_modules`,
              `cp -r ../node_modules/@opentelemetry ${outputDir}/node_modules/@opentelemetry`,
              `cp -r ../node_modules/http ${outputDir}/node_modules/`,
              `cp -r ../node_modules/graphql ${outputDir}/node_modules/`
            ]
          },
          beforeBundling() {
            return []
          },
          beforeInstall() {
            return[]
          }
        },
        keepNames: true,
        // Mark OpenTelemetry package and OTel-instrumented packages as external so they aren't bundled
        externalModules: [
          '@opentelemetry/instrumentation-http',
          '@opentelemetry/instrumentation-express',
          '@opentelemetry/instrumentation',
          '@opentelemetry/instrumentation-graphql',
          '@opentelemetry/instrumentation-aws-lambda',
          '@opentelemetry/api',
          '@opentelemetry/core',
          '@opentelemetry/resources',
          '@opentelemetry/tracing',
          '@opentelemetry/node',
          '@opentelemetry/exporter-collector',
          'http',
          'graphql'
        ]
      },
    });

Here's the resulting trace in X-Ray:

image

And the resulting Lambda code asset is 6.9MB vs 13MB if I uploaded the entire thing, and the difference would likely get quite a bit larger as more dependencies are added. So it might be a little hacky, but it might not be too terrible of a hack to use...

longility commented 2 years ago

Glad you're making progress.

tekkeon commented 2 years ago

Just to update, I believe the layer actually did start working once I removed esbuild. The CloudWatch logs were showing the Lambda invocations from the Collector so I was under the impression it would show the other trace data as well. But in fact the trace data does appear to be showing up in XRay.

mattfysh commented 2 years ago

Hey all, I found a couple of solutions to the esbuild issues:

  1. Immutable (cannot redefine 'handler') - make sure your entry file is common JS and use module.exports = { handler }
  2. The second issue is a little tricky, with the exporters expecting to find protobuf files in node_modules

Here is how I solved it, within the build script:

esbuild src/index.js --bundle --minify \
  --platform=node \
  --outfile=$OUTPUT_DIR/index.js

esbuild src/tracing.js --bundle --minify \
  --platform=node \
  --outfile=$OUTPUT_DIR/tracing/build/index.js

cp -r node_modules/@opentelemetry/exporter-trace-otlp-grpc/build/protos \
  $OUTPUT_DIR/tracing/