Closed longility closed 3 years ago
Hi @longility - how did you build the lambda layer? You didn't mention it so I wonder if you didn't check out opentelemetry-lambda.
https://github.com/open-telemetry/opentelemetry-lambda
Also we have a customization to export to x-ray here if it provides more pointers
Hey @anuraaga thanks for getting back and commenting
how did you build the lambda layer?
In short, the lambda layer I created is built by using this as reference: https://github.com/open-telemetry/opentelemetry-lambda/tree/main/nodejs/packages/layer
I went through many attempts to get to where I am at. My latest attempt is to have the lambda layer export to a container with a base image of amazon/aws-otel-collector
that exports to x-ray and honeycomb. I am essentially decoupling the collector from the lambda function/layer to verify and get working in isolation. However, the spans are NOT being exported. Hence my conclusion on the issue.
I tested my collector by deploying the container into an ecs cluster and ran a local graphql server against that remote collector, and that works as expected. I see the expected spans such as graphQL and aws sdk spans from dynamodb.
Another reason why I think it has to do with spans not being exported and not the connection to the remote collector is that when I have a bad OTEL endpoint, it would log the errors as I have enabled logging to see that locally. I am not receiving the error logs or the success logs. I also logged the endpoint to make sure it is the same endpoint I used locally as I am using for the lambda to double check and the endpoint was the same.
Followed this: https://aws-otel.github.io/docs/getting-started/lambda/lambda-js
Latest conclusion: I did not see aws sdk or graphQL spans so I think the spans are not being exported through otel instrumentation and otel collector, but I think the traces I see in x-ray are being exported through whatever built-in x-ray instrumentation, not otel. That is why I decided to go through my latest experiment. Basically, I thinking the ADOT node lambda layer is not working b/c the spans are not being exported perhaps like the issue I'm theorizing.
I was wondering if x-ray is the issue so I made an attempt to export to honeycomb following this: https://aws-otel.github.io/docs/getting-started/lambda#custom-configuration-for-the-adot-collector-on-lambda
However, my latest attempt I decoupled the collector to point to both honeycomb and x-ray and they work as I expect as mentioned earlier. so I have the collector side working to where I conclude it is the lambda part of exporting spans.
On Monday, I think I will copy in AWS lambda instrumentation so I can do more logging to see what paths it is taking to understand how that instrumentation works. I'm new to all this so I'm slowly have to unravel to learn and debug. I am under the impression that the AWS lambda instrumentation has an expected responsibility to force flush the spans before the function terminates. That is why I'm going down this path to see where it ends up. Additionally, I may attempt to short circuit to force flush to see if it goes to the collector to further identify what is working and what is not working.
@anuraaga do you have any other pointers? Not sure if you tried recently to test out ADOT node lambda layer to have good confidence that it should be working.
In short, the lambda layer I created is built by using this as reference
If possible we need the long, not the short ;) So it sounds like you are building a layer from scratch, without the collector. And you are using the same exec wrapper technique? What changes have you made to the code?
Is the collector running externally on ECS and the lambda exports to that?
AWS lambda instrumentation has an expected responsibility to force flush the spans before the function terminates.
The instrumentation force flushes so user code doesn't have to call it.
Not sure if you tried recently to test out ADOT node lambda layer to have good confidence that it should be working.
We run canary tests periodically which verify that spans for aws SDK are sent to xray. So that is working for the public lambda layers. However honeycomb isn't yet supported on the public layer which might be why you didn't see spans there? But if you saw problems with the x-ray export can you share the relevant parts of the lambda function definition?
Yes, I’m using the same wrapper technique.
I think I found one of the issues. The BasicTracerProvider
dependency that is created and the dependency that is in the instrumentation has a pkg mismatch so when we check instanceof
, it is false and so it does not force flush.
It is related to this refactor: https://github.com/open-telemetry/opentelemetry-js/pull/2340#issuecomment-887313662
I’m not sure what the fix should be due to the interesting coupling that may not be straightforward for me such as the coupling to the dependency versions the node lambda layer is referencing.
@longility Are you calling the OpenTelemetry API in your function, or just need it auto-instrumented? I think if you just use the same versions as the lambda layer's package.json (notably using 0.24.0 of opentelemetry/node) it should work ok. If you use the API in your function, you'd need to make sure it's 1.0.1 to match the layer.
Yes, I’m using the API. I’ll double check that tomorrow as I try to get things working. Thanks!
I’m still new and challenged by so many packages! The debugging exercise is forcing me to learn.
I'm going to take that back on my conclusion of version mismatch. I think what I have in code is fine. I think I got confused as I was jumping between what is latest in github and what is on my local.
@longility one other question - are you enabling the AWS X-Ray propagator as described here: https://github.com/open-telemetry/opentelemetry-js-contrib/blob/main/propagators/opentelemetry-propagator-aws-xray/README.md?
@longility one other question - are you enabling the AWS X-Ray propagator as described here: https://github.com/open-telemetry/opentelemetry-js-contrib/blob/main/propagators/opentelemetry-propagator-aws-xray/README.md?
I am not. I just double checked as I thought that was added by the aws lambda instrumentation. I was wrong. It just has a reference but it doesn't enable it. Thanks!
@anuraaga I found the issue as with a code change, I was able to get working. It has to do with this line: https://github.com/open-telemetry/opentelemetry-js-contrib/blob/ee0debf9a1ba3889b480a36a8b5c7ce04d8f35aa/plugins/node/opentelemetry-instrumentation-aws-lambda/src/instrumentation.ts#L270
What I determined for at least my case is that the setTracerProvider
is passing in a ProxyTracerProvider
.
Which I created a helper method to get BasicTracerProvider
like so:
private _getBasicTracerProvider() {
let realTracerProvider = this._tracerProvider;
if (this._tracerProvider instanceof ProxyTracerProvider) {
realTracerProvider = this._tracerProvider.getDelegate();
}
return realTracerProvider instanceof BasicTracerProvider
? realTracerProvider
: undefined;
}
then the caller looks like this now
const basicTracerProvider = this._getBasicTracerProvider();
if (basicTracerProvider) {
basicTracerProvider
.getActiveSpanProcessor()
.forceFlush()
.then(
() => callback(),
() => callback()
);
} else {
callback();
}
If this makes sense, I can put in a PR for it.
Thanks for the dig @longility! @dyladan do you have any suggestion on how to invoke forceFlush from the instrumentation? Is there a better pattern than having to check instanceof for both BasicTracerProvider and ProxyTracerProvider?
I think we should add forceFlush
to API and as a result avoid such unwanted dependencies between instrumentation and SDK.
But I'm not sure if this is allowed from spec.
Even with above fix you could easily end up in problems if someone implements it's own TracerProvider
which would support getActiveSpanProcessor()
.
Should this problem be occurring for a subset of use cases?
I didn’t customize much in regards to tracer provider and it uses the ProxyTracerProvider
, which seem to be a common approach. However, I may be wrong. So what I’m wondering is…what is different? I am using an upgraded version. I didn’t try investigating further yet.
Thanks for the dig @longility! @dyladan do you have any suggestion on how to invoke forceFlush from the instrumentation? Is there a better pattern than having to check instanceof for both BasicTracerProvider and ProxyTracerProvider?
No. I tried to add a force flush to the spec a while ago but it didn't get enough support.
From my quick investigation with limited knowledge, it seems to have been ProxyTracerProvider
for at least the past year if not from beginning.
With that said, I probably recommend getting the fix in even if it isn’t ideal. Seems critical. Not sure how come it hasn’t come up before. I don’t have enough context and history to piece this together.
Thanks for the dig @longility! @dyladan do you have any suggestion on how to invoke forceFlush from the instrumentation? Is there a better pattern than having to check instanceof for both BasicTracerProvider and ProxyTracerProvider?
I have 2 suggestions which would not require spec changes:
I know both of these aren't great solutions. I would far prefer to have a forceFlush function in the API, but that would require a spec change.
@dyladan
Instead of instanceof you can just check for typeof === 'function' on the important methods
If the BasicTracerProvider
has .getActiveSpanProcessor()
and the ProxyTracerProvider
does not, I'm not sure how that would work unless I'm mistaken on what you meant.
Add a flush function configuration option, or a tracerprovider configuration option to the constructor
Can you elaborate on this as I was not able to follow? I'm not sure whose constructor you are referring to. AwsLambdaInstrumentation
's constructor, or something else?
@dyladan
Instead of instanceof you can just check for typeof === 'function' on the important methods
If the
BasicTracerProvider
has.getActiveSpanProcessor()
and theProxyTracerProvider
does not, I'm not sure how that would work unless I'm mistaken on what you meant.
I was suggesting you do something very similar to the code you linked above but instead of using instanceof
checks which sometimes fail for unexpected reasons you can check for functions. But you may decide you like instanceof checks better since they're more explicit.
private _getBasicTracerProvider() {
let realTracerProvider = this._tracerProvider;
if (typeof this._tracerProvider.getDelegate === 'function') {
realTracerProvider = this._tracerProvider.getDelegate();
}
if (typeof realTracerProvider.getActiveSpanProcessor === 'function') {
return realTracerProvider
}
return undefined;
}
Add a flush function configuration option, or a tracerprovider configuration option to the constructor
Can you elaborate on this as I was not able to follow? I'm not sure whose constructor you are referring to.
AwsLambdaInstrumentation
's constructor, or something else?
I was suggesting the aws instrumentation config options be extended to have a flush option, then when it is configured you can pass a function which has the tracerprovider in closure.
Something along the lines of this:
const tp = new BasicTracerProvider(opts);
tp.addSpanProcessor(spanProcessor);
const awsInstr = new AwsLambdaInstrumenatation({
flush: async () => {
await tp.getActiveSpanProcessor().forceFlush();
},
})
Honestly, I would much rather have a force flush in the API as I've already said above.
We could avoid the binding of instrumentations to SDK by adding above flush code in @opentelemetry/instrumentation
. This would at least avoid to re-implement above hacky stuff in several instrumentations.
We could avoid the binding of instrumentations to SDK by adding above flush code in
@opentelemetry/instrumentation
. This would at least avoid to re-implement above hacky stuff in several instrumentations.
While I like the approach if there are many scenarios and we want to make it easier explicit and be DRY, I’m wondering how many scenarios we have to need to force flush. Force flush in itself seems not ideal, and it just so happens that lambda may be the only special case that I can see. So maybe unless there is more need, hide the nastiness in the instrumentation that is needed. In this case, aws lambda instrumentation.
I guess all FaaS environments would need this not just AWS lambda. And maybe browers in case the host PC goes into suspend but in that case it's most likely no instrumentation which handles this.
@Flarna Thanks for adding details to my gap knowledge.
If that is the case and if I understand you correctly, then it would look something like this...
export interface Instrumentation {
// ..
// removed for brevity
forceFlushSpans(): Promise<void>;
}
As for implementation detail, I’m thinking we should force flush if available. if not available, then diag.error
and promise resolve, not reject. Something like that?
One can spin their wheels wondering why the spans are not exported like me.
@longility have you taken a crack at a PR here? If not, I might try to implement @dyladan's simpler suggestion to just do type checking on the ProxyTracerProvider
and extract the real TracerProvider if needed. While I see that the forceFlush issue might apply to other instrumentations later, I tend to agree with @longility that we can deal with it internally in the AWS Lambda instrumentation for now before adding an edge-case field to the Instrumentation
API. We can always revert it and go with the long-term fix via an API change later.
I've also confirmed that this bug is what's causing tests to fail in #660, so I'd like to get a fix out ASAP and will probably make that PR tomorrow.
@willarmiros I didn't know the direction so I didn't create a PR. I just copied in the code and made it work for us. I can send a quick PR to get things started.
I just read over this whole conversation because I am currently facing a similar issue, trying to instrument an Apollo GraphQL Server Lambda function and getting no exported spans. Based on what I've read here it seems like this issue should have been resolved, and I made sure I'm using the most up-to-date packages. I detailed what I've tried and a link to a repo with a demo in this issue under Apollo Server's repo (which I realized after the fact is more likely an issue here rather than there): https://github.com/apollographql/apollo-server/issues/5923
Please let me know if I am doing something incorrectly! I can also open a new issue if it seems like this has a different root cause, just seemed pretty heavily related to this issue so figured I'd comment here first.
@mkossoris I got your exact scenario working. What do you see in your lambda logs? You may need more logging for me to help.
@longility Interesting, I went ahead and re-deployed the repo to a new AWS account just to make sure I was starting from a clean slate. I am still not seeing the trace data in the logs.
Here is some of the output I see when I run the local server:
...
{
traceId: '184db6230d94288669a438f8e2d7630d',
parentId: '5f4eb030569e0c15',
name: 'middleware - <anonymous>',
id: '7d6d2d427b4aa566',
kind: 0,
timestamp: 1638219365725886,
duration: 27,
attributes: {
'http.route': '/',
'express.name': '<anonymous>',
'express.type': 'middleware'
},
status: { code: 0 },
events: []
}
{
traceId: '184db6230d94288669a438f8e2d7630d',
parentId: undefined,
name: 'POST /graphql',
id: '5f4eb030569e0c15',
kind: 1,
timestamp: 1638219365720265,
duration: 12641,
attributes: {
'http.url': 'http://localhost:3000/graphql',
'http.host': 'localhost:3000',
'net.host.name': 'localhost',
'http.method': 'POST',
'http.target': '/graphql',
'http.request_content_length_uncompressed': 1811,
'http.flavor': '1.1',
'net.transport': 'ip_tcp',
'net.host.ip': '::1',
'net.host.port': 3000,
'net.peer.ip': '::1',
'net.peer.port': 56183,
'http.status_code': 200,
'http.status_text': 'OK',
'http.route': ''
},
status: { code: 1 },
events: []
}
As you can see, the trace data shows up properly when running it locally. And here is the output I see in the CloudWatch logs when running it on a deployed Lambda:
2021-11-29T12:53:18.533-08:00 | START RequestId: 67ead200-83ad-4745-8e57-b6e4b34ef276 Version: $LATEST
2021-11-29T12:53:18.544-08:00 | END RequestId: 67ead200-83ad-4745-8e57-b6e4b34ef276
2021-11-29T12:53:18.544-08:00 | REPORT RequestId: 67ead200-83ad-4745-8e57-b6e4b34ef276 Duration: 7.06 ms Billed Duration: 8 ms Memory Size: 1500 MB Max Memory Used: 108 MB
So none of the trace data seems to be exporting when running within the Lambda. What logs are you seeing when running it on Lambda?
Can you enable otel logging so we have a better idea?
import { diag, DiagConsoleLogger } from "@opentelemetry/api";
import { getEnv } from "@opentelemetry/core";
diag.setLogger(new DiagConsoleLogger(), getEnv().OTEL_LOG_LEVEL);
env var to enable
OTEL_LOG_LEVEL=all
Actually, I forgot, we created our own lambda layer and not rely on what aws provided. Not sure if you went that route.
@mkossoris do you have active tracing enabled on your Lambda? If not, have you manually added X-Ray permissions to your function role? Also, are you setting a different endpoint for the OTel collector via env var?
@longility Here is the result of the adding the otel logging:
2021-11-29T16:36:32.859-08:00 | 2021-11-30T00:36:32.858Z undefined DEBUG @opentelemetry/api: Registered a global for diag v1.0.3.
| 2021-11-29T16:36:32.864-08:00 | 2021-11-30T00:36:32.863Z undefined DEBUG @opentelemetry/api: Registered a global for trace v1.0.3.
| 2021-11-29T16:36:32.864-08:00 | 2021-11-30T00:36:32.864Z undefined DEBUG @opentelemetry/api: Registered a global for context v1.0.3.
| 2021-11-29T16:36:32.864-08:00 | 2021-11-30T00:36:32.864Z undefined DEBUG @opentelemetry/api: Registered a global for propagation v1.0.3.
| 2021-11-29T16:36:32.870-08:00 | 2021-11-30T00:36:32.869Z undefined DEBUG @opentelemetry/instrumentation-http Applying patch for http@14.18.1
| 2021-11-29T16:36:32.903-08:00 | 2021-11-30T00:36:32.903Z undefined DEBUG @opentelemetry/instrumentation-http Applying patch for https@14.18.1
| 2021-11-29T16:36:33.011-08:00 | START RequestId: 66d6596b-56ae-4187-82eb-bf7d6860cc57 Version: $LATEST
| 2021-11-29T16:36:33.027-08:00 | END RequestId: 66d6596b-56ae-4187-82eb-bf7d6860cc57
| 2021-11-29T16:36:33.027-08:00 | REPORT RequestId: 66d6596b-56ae-4187-82eb-bf7d6860cc57 Duration: 15.51 ms Billed Duration: 16 ms Memory Size: 1500 MB Max Memory Used: 94 MB Init Duration: 493.64
The debug logs are only displayed on startup - further requests only show the START, END, REPORT, and Billed Duration logs from Lambda.
@willarmiros and @longility For this particular test, I'm trying to remove variables by just registering the instrumentation myself in the Lambda function and using the ConsoleExporter - not the CollectorExporter and not using the Lambda extension. I have also tried it with the the Lambda extension and had no luck with that.
Here are some logs from running it just with the Lambda Layer:
| 2021-11-29T16:46:02.843-08:00 | 2021-11-30T00:46:02.842Z info builder/receivers_builder.go:68 Receiver is starting... {"kind": "receiver", "name": "otlp"}
| 2021-11-29T16:46:02.843-08:00 | 2021-11-30T00:46:02.843Z info otlpreceiver/otlp.go:68 Starting GRPC server on endpoint 0.0.0.0:4317 {"kind": "receiver", "name": "otlp"}
| 2021-11-29T16:46:02.843-08:00 | 2021-11-30T00:46:02.843Z info otlpreceiver/otlp.go:86 Starting HTTP server on endpoint 0.0.0.0:4318 {"kind": "receiver", "name": "otlp"}
| 2021-11-29T16:46:02.843-08:00 | 2021-11-30T00:46:02.843Z info otlpreceiver/otlp.go:141 Setting up a second HTTP listener on legacy endpoint 0.0.0.0:55681 {"kind": "receiver", "name": "otlp"}
| 2021-11-29T16:46:02.843-08:00 | 2021-11-30T00:46:02.843Z info otlpreceiver/otlp.go:86 Starting HTTP server on endpoint 0.0.0.0:55681 {"kind": "receiver", "name": "otlp"}
| 2021-11-29T16:46:02.843-08:00 | 2021-11-30T00:46:02.843Z info builder/receivers_builder.go:73 Receiver started. {"kind": "receiver", "name": "otlp"}
| 2021-11-29T16:46:02.845-08:00 | 2021-11-30T00:46:02.843Z info service/telemetry.go:92 Setting up own telemetry...
| 2021-11-29T16:46:02.845-08:00 | 2021-11-30T00:46:02.844Z info service/telemetry.go:116 Serving Prometheus metrics {"address": ":8888", "level": "basic", "service.instance.id": "d40872f8-9119-4557-900e-986eb5c125f8", "service.version": "latest"}
| 2021-11-29T16:46:02.845-08:00 | 2021-11-30T00:46:02.844Z info service/collector.go:230 Starting otelcol... {"Version": "v0.1.0", "NumCPU": 2}
| 2021-11-29T16:46:02.848-08:00 | 2021-11-30T00:46:02.844Z info service/collector.go:132 Everything is ready. Begin running and processing data.
| 2021-11-29T16:46:02.848-08:00 | 2021/11/30 00:46:02 Registered extension ID: "e1e6e8a0-c361-4cad-a4a6-d9040d111dd3"
| 2021-11-29T16:46:02.848-08:00 | 2021/11/30 00:46:02 [collector] Register response: {
| 2021-11-29T16:46:02.848-08:00 | "functionName": "<FUNCTION_NAME>",
| 2021-11-29T16:46:02.848-08:00 | "functionVersion": "$LATEST",
| 2021-11-29T16:46:02.848-08:00 | "handler": "index.handler"
| 2021-11-29T16:46:02.848-08:00 | }
| 2021-11-29T16:46:03.297-08:00 | 2021/11/30 00:46:02 [collector] Waiting for event...
| 2021-11-29T16:46:03.597-08:00 | Registering OpenTelemetry
| 2021-11-29T16:46:03.597-08:00 | EXTENSION Name: collector State: Ready Events: [INVOKE,SHUTDOWN]
| 2021-11-29T16:46:03.599-08:00 | 2021/11/30 00:46:03 [collector] Received event: {
| 2021-11-29T16:46:03.599-08:00 | "eventType": "INVOKE",
| 2021-11-29T16:46:03.599-08:00 | "deadlineMs": 1638233166598,
| 2021-11-29T16:46:03.599-08:00 | "requestId": "3bfedfe8-5f5f-45a6-849d-71b540a42896",
| 2021-11-29T16:46:03.599-08:00 | "invokedFunctionArn": "<FUNCTION_ARN>",
| 2021-11-29T16:46:03.599-08:00 | "tracing": {
| 2021-11-29T16:46:03.599-08:00 | "type": "X-Amzn-Trace-Id",
| 2021-11-29T16:46:03.599-08:00 | "value": "Root=1-61a5744a-6644bc3f2571b9bb356781b7;Parent=5e8721674fab873d;Sampled=1"
| 2021-11-29T16:46:03.599-08:00 | }
| 2021-11-29T16:46:03.599-08:00 | }
| 2021-11-29T16:46:03.599-08:00 | 2021/11/30 00:46:03 [collector] Waiting for event...
| 2021-11-29T16:46:03.886-08:00 | END RequestId: 3bfedfe8-5f5f-45a6-849d-71b540a42896
| 2021-11-29T16:46:03.886-08:00 | REPORT RequestId: 3bfedfe8-5f5f-45a6-849d-71b540a42896 Duration: 287.92 ms Billed Duration: 288 ms Memory Size: 1500 MB Max Memory Used: 148 MB Init Duration: 941.74 ms XRAY TraceId: 1-61a5744a-6644bc3f2571b9bb356781b7 SegmentId: 5e8721674fab873d Sampled: true
| 2021-11-29T16:46:03.927-08:00 | START RequestId: bc1f6210-f4f9-4744-9034-4fc2797e08df Version: $LATEST
| 2021-11-29T16:46:03.930-08:00 | 2021/11/30 00:46:03 [collector] Received event: {
| 2021-11-29T16:46:03.930-08:00 | "eventType": "INVOKE",
| 2021-11-29T16:46:03.930-08:00 | "deadlineMs": 1638233166929,
| 2021-11-29T16:46:03.930-08:00 | "requestId": "bc1f6210-f4f9-4744-9034-4fc2797e08df",
| 2021-11-29T16:46:03.930-08:00 | "invokedFunctionArn": "<FUNCTION_ARN>",
| 2021-11-29T16:46:03.930-08:00 | "tracing": {
| 2021-11-29T16:46:03.930-08:00 | "type": "X-Amzn-Trace-Id",
| 2021-11-29T16:46:03.930-08:00 | "value": "Root=1-61a5744b-724089b277f6b495264d64d9;Parent=7c56ed2b60cefdf2;Sampled=1"
| 2021-11-29T16:46:03.930-08:00 | }
With those logs, we can see that the Lambda layer is clearly running, but it is only collecting traces about the Lambda invocation and not the GraphQL spans. And this trace data does show in XRay so it isn't a permissions issue between the Lambda function and XRay.
Hi @mkossoris
I'm trying to remove variables by just registering the instrumentation myself in the Lambda function and using the ConsoleExporter
I think this is a great way to debug! I've personally found it useful to confirm that you can see the traces in the Lambda logs first before you expect them to arrive at the Collector and then to the AWS X-Ray console.
Could you share the code you use to initialize OTel JS? Particularly, do you register the instrumentation for GraphQL? I can see from your debug logs above that your OTel JS seems to be instrumenting the http
and https
packages:
| 2021-11-29T16:36:32.870-08:00 | 2021-11-30T00:36:32.869Z undefined DEBUG @opentelemetry/instrumentation-http Applying patch for http@14.18.1
| 2021-11-29T16:36:32.903-08:00 | 2021-11-30T00:36:32.903Z undefined DEBUG @opentelemetry/instrumentation-http Applying patch for https@14.18.1
Maybe you could try making a request with the http
library and seeing if you still do not get any traces? Something like
http.get('http://aws.amazon.com', () => {});
Also not sure if you are aware but we have documentation for using the published ADOT Lambda for NodeJS Lambda Layer (i.e. arn:aws:lambda:\<region>:901920570463:layer:aws-otel-nodejs-ver-1-0-0:1
) which shows you how to set up auto instrumentation if you wanted to compare that against your configuration?
Hey @NathanielRN! I've created a repo with the code I've been using here: https://github.com/mkossoris/apollo-server-lambda-otel-issue. It does register the GraphQL instrumentation here: https://github.com/mkossoris/apollo-server-lambda-otel-issue/blob/master/src/open-telemetry.ts#L20.
I'll definitely try adding the http call just to see what happens. As for the ADOT Lambda configuration I tried, you can see the CDK code for that here: https://github.com/mkossoris/apollo-server-lambda-otel-issue/blob/master/cdk/lib/cdk-stack.ts#L10
@mkossoris I couldn't make too much sense from log.
However, looking at your code, know that opentelemtry-js does monkeypatching based on actual package names so if you use esbuild through NodejsFunction construct, it would minify and not able to instrument properly.
Secondly, I actually have not gotten the Otel bootstrapping to work within lambda fn, only lambda layer extension.
Thirdly, tracing needs to be enabled as mentioned by @willarmiros even if you aren't using x-ray.
@longility I was just coming back here to say I found out it was because of using NodejsFunction construct after trying it with standard Lambda function instead... facepalm.
That being said, the extension still did not work entirely by itself as far as pushing it to XRay goes. I was, however, able to get it to work by importing the open-telemetry file from my example and replacing the ConsoleExporter with the CollectorTraceExporter, which pushes the traces to the Lambda Extension's Collector port, and that did actually work.
Progress at least!
So then that leaves me with an important question: is there no way for us to be able to bundle our code and use OpenTelemetry's instrumentations? With Lambda size limits and how large node_modules dir can be, bundling is going to be pretty essential for us.
I see there's this issue that discusses OpenTelemetry's issues with esbuild: https://github.com/aws-observability/aws-otel-lambda/issues/99
At the end, the creator of the ticket says:
I will close the ticket unless you think of a different workaround.
Which sort of implies there was some workaround at all? Or is there no known workaround for using esbuild or webpack with OTel instrumentations at this time?
The only thing I can think of off the top of my head is to mark the OTel instrumentations and their related packages (e.x. @open-telemetry/instrumentation-graphql and graphql) as externals and upload the external packages unbundled in node_modules or in a layer. Haven't tested if this works yet though.
As far as I know, there is no workaround for minify problem with Otel js monkey patch at this time. You would have to take the node_modules size hit.
Since you are dealing with GraphQL and lambda, I assume you would compose with other lambda functions with Apollo federation. Are there concerns for you with latency for end user? I'm asking to think long term if lambda is the solution as we have been asking these types of questions on our end.
@longility Yep that's correct, we're intending to follow a similar design to what IMDb is using, which they posted an article about here: https://aws.amazon.com/blogs/compute/building-federated-graphql-on-aws-lambda/. It uses a Lambda to host the gateway and one Lambda per subgraph.
@longility So this is a pretty hacky initial attempt, but it surprisingly works lol:
const oTelLambdaLayer = lambda.LayerVersion.fromLayerVersionArn(
this,
"oTelLambdaLayer",
"arn:aws:lambda:us-west-2:901920570463:layer:aws-otel-nodejs-ver-1-0-0:1"
);
const server = new NodejsFunction(this, "ServerFunction", {
entry: "../src/server.ts",
handler: "handler",
memorySize: 1500,
environment: {
OTEL_LOG_LEVEL: "all",
AWS_LAMBDA_EXEC_WRAPPER: "/opt/otel-handler",
},
layers: [oTelLambdaLayer],
tracing: lambda.Tracing.ACTIVE,
bundling: {
commandHooks: {
afterBundling(inputDir, outputDir) {
// Manually copy OpenTelemetry packages and OTel-instrumented packages to output dir
return [
`cp ../build/open-telemetry.js ${outputDir}`,
`mkdir ${outputDir}/node_modules`,
`cp -r ../node_modules/@opentelemetry ${outputDir}/node_modules/@opentelemetry`,
`cp -r ../node_modules/http ${outputDir}/node_modules/`,
`cp -r ../node_modules/graphql ${outputDir}/node_modules/`
]
},
beforeBundling() {
return []
},
beforeInstall() {
return[]
}
},
keepNames: true,
// Mark OpenTelemetry package and OTel-instrumented packages as external so they aren't bundled
externalModules: [
'@opentelemetry/instrumentation-http',
'@opentelemetry/instrumentation-express',
'@opentelemetry/instrumentation',
'@opentelemetry/instrumentation-graphql',
'@opentelemetry/instrumentation-aws-lambda',
'@opentelemetry/api',
'@opentelemetry/core',
'@opentelemetry/resources',
'@opentelemetry/tracing',
'@opentelemetry/node',
'@opentelemetry/exporter-collector',
'http',
'graphql'
]
},
});
Here's the resulting trace in X-Ray:
And the resulting Lambda code asset is 6.9MB vs 13MB if I uploaded the entire thing, and the difference would likely get quite a bit larger as more dependencies are added. So it might be a little hacky, but it might not be too terrible of a hack to use...
Glad you're making progress.
Just to update, I believe the layer actually did start working once I removed esbuild. The CloudWatch logs were showing the Lambda invocations from the Collector so I was under the impression it would show the other trace data as well. But in fact the trace data does appear to be showing up in XRay.
Hey all, I found a couple of solutions to the esbuild issues:
module.exports = { handler }
Here is how I solved it, within the build script:
esbuild src/index.js --bundle --minify \
--platform=node \
--outfile=$OUTPUT_DIR/index.js
esbuild src/tracing.js --bundle --minify \
--platform=node \
--outfile=$OUTPUT_DIR/tracing/build/index.js
cp -r node_modules/@opentelemetry/exporter-trace-otlp-grpc/build/protos \
$OUTPUT_DIR/tracing/
What version of OpenTelemetry are you using?
What version of Node are you using?
v14.17.4
What did you do?
Create a lambda layer where I started from this: https://github.com/open-telemetry/opentelemetry-js#instantiate-tracing
What did you expect to see?
I was expecting to see
statusCode: 200
to indicate that it got exported correctly. I saw this in the debug log when I was running a server, not lambda function. Also, I am not seeing any associated traces. I am testing against the same collector.What did you see instead?
Here is the actual log
Additional context
Any ideas how to debug or understand what the issue is on why the spans are not exporting?
I was lead from here: https://github.com/open-telemetry/opentelemetry-js/issues/1295#issuecomment-907476081 /cc @willarmiros @anuraaga