open-telemetry / opentelemetry-js

OpenTelemetry JavaScript Client
https://opentelemetry.io
Apache License 2.0
2.65k stars 765 forks source link

Trace Context Lost by Electron JS IPC Call #3814

Open sernst opened 1 year ago

sernst commented 1 year ago

What happened?

Steps to Reproduce

I've created a minimal example that demonstrates the issue here: https://github.com/sernst/electron-otel-issue

The issue is that the trace context is not being preserved across an Electron IPC call. In the linked example application, an onLoad span is created when the index.html page loads. Console output for the span looks like this:

{
    "traceId": "f78a9bdbdb090c69f30ca6f99860202e",
    "name": "1. onLoad",
    "id": "da41bcc241419700",
    "kind": 0,
    "timestamp": 1684330163630000,
    "duration": 4000,
    "attributes": {},
    "status": {
        "code": 0
    },
    "events": [],
    "links": []
}

Within the onLoad span, two IPC calls are made to the Electron main process. The first IPC call, foo, shares the trace ID with its onLoad parent span like this:

{
    "traceId": "f78a9bdbdb090c69f30ca6f99860202e",
    "parentId": "da41bcc241419700",
    "name": "2. runFoo",
    "id": "a1422c247b61b860",
    "kind": 0,
    "timestamp": 1684330163631000,
    "duration": 1700,
    "attributes": {},
    "status": {
        "code": 0
    },
    "events": [],
    "links": []
}

However, the second IPC call, bar, does not preserve the trace ID. Nor is the parent ID the parent span that called it:

{
    "traceId": "d5d0465e78aa50eb933a4e4756efdbc5",
    "name": "3. runBar",
    "id": "6de85efb1b8770eb",
    "kind": 0,
    "timestamp": 1684330163634000,
    "duration": 400,
    "attributes": {},
    "status": {
        "code": 0
    },
    "events": [],
    "links": []
}

Expected Result

I would expect that trace context would be preserved across IPC calls to the Electron backend.

Actual Result

Any IPC call resets the tracing context and spans that should be within a trace are orphaned.

OpenTelemetry Setup Code

https://github.com/sernst/electron-otel-issue/blob/main/index.js

package.json

https://github.com/sernst/electron-otel-issue/blob/main/package.json

Relevant log output

No response

dyladan commented 1 year ago

I think this would require a separate instrumentation specifically for electron IPC. This isn't a bug so I'm going to close it as wont-fix. Please make an issue requesting a new instrumentation in contrib if this is something you would like to see.

t2t2 commented 1 year ago

Checking the code in index.js I noticed you're using async/await. This is most likely a limitation in zone.js context maanger: https://github.com/open-telemetry/opentelemetry-js/issues/2655#issuecomment-987738203

sernst commented 1 year ago

@dyladan Would you be willing to clarify why it's not a bug and would require separate instrumentation? I'm using the API to instrument directly, which surprises me that there's no capability in the core modules to custom implement something workable. Naively, I would assume that any custom implementation would be built upon the API, which would suggest that there's a way to work around this in the API - even if it's an ugly one - and that any contrib instrumentation would be similarly built upon that. Is it the case that a different ContextManager implementation would be needed, and such managers fall out of scope for the core API?

In either case, any guidance on where I would start looking toward contributing such custom instrumentation myself?

sernst commented 1 year ago

@t2t2 Is the suggestion to avoid async/await and the problem would be solved? In more complex cases, I've been using async/await without observing that issue, but perhaps there's a nuance I'm missing in regard to how it would manifest this way.

dyladan commented 1 year ago

Sorry I just assumed you were using the autoinstrumentation since most of our users are. You are correct that there should be a way to work around it with the API. Most likely you need to use the propagation api. Particularly you will need to inject context at the caller then extract it at the receiver.

dyladan commented 1 year ago

I still don't think this is a bug but I'll reopen it. I think this can either be solved by manually propagating as mentioned above or by creating a new instrumentation which automatically propagates.

sernst commented 1 year ago

@dyladan Appreciate the response and clarification. I omitted the manual context propagation in the minimal example I included, but I am using the propagation API to send context through the IPC call and enable distributed tracing. That part is working as I'd hoped in that I do successfully get distributed tracing for the first IPC call.

However, the context is reset once the IPC call completes. Something like this:

// 1. Inject context into a carrier for manual IPC propagation.
const carrier = {};
propagation.inject(context.active(), carrier);
console.log("BEFORE:", carrier);
// BEFORE: { traceparent: "00-abcdef...

// 2. The carrier is extracted in the IPC receiver and used to enable distributed tracing successfully.
const response = await makeSomeIpcCall(carrier);

// 3. Example of how repeating the injection now leads to an empty context. traceparent and tracestate have been reset.
const afterCarrier = {};
propagation.inject(context.active(), afterCarrier);
console.log("AFTER:", afterCarrier);
// AFTER: {}

I still have the injected carrier data in scope that was sent with the IPC after it has returned. However, I've found no way to extract that into the active context utilized by the subsequent spans called within the original context without passing that carrier data all the way up the call stack to extract the context again at the top, which gets out of hand quickly.

Is there some way to extract the carrier after the IPC call such that the context becomes active for subsequent spans up the call stack that I'm missing? Or some other mechanism? It does seem like that was intentionally not possible because it would be potentially corrupting. At the same time, it's clearly possible in some fashion because some mechanism is responsible for the reset across the IPC call. But since the context is reset within an ongoing contextual tracing action, I though this was a bug more than a question because I could not find a manual, public API workaround to support this case. Thoughts?

t2t2 commented 1 year ago

Is the suggestion to avoid async/await and the problem would be solved? In more complex cases, I've been using async/await without observing that issue, but perhaps there's a nuance I'm missing in regard to how it would manifest this way.

So it's not an issue if the code is transpiled to convert async-await into... usually it gets transpiled into generators which zone.js seems to support better

If your build stack in the real project is the same, then good news is that esbuild (which vite uses internally) has supported option which allows specifying which language features are supported:

image

image

sernst commented 1 year ago

@t2t2 Thanks for that information. It's certainly unexpected to see async/await not supported, and the link earlier that clarifies the reason why is helpful to understand.

sernst commented 1 year ago

For anyone that ends up here in the future, I've been able to make this work using additional zone.js plugins. The example used in raising this issue has been updated with a working solution for reference:

https://github.com/sernst/electron-otel-issue

My remaining question, or more topic for clarification/discussion I suppose at this point is: is there an opportunity to document the the use of zones more to help those not on a standard happy-path OTel implementation here. Or is that already documented somewhere that I missed prior to creating this issue? I had to dig into zone.js to understand what was going on and ultimately solve my issue. Give the complexity of zone.js, it seems to me like exposing that information in some form would be beneficial for anyone working on custom telemetry. Unless there's a deliberate decision to exclude that complexity. I'd certainly be happy to contribute to the docs if someone from the project thought that would be beneficial and pointed me in the right direction. Thoughts?

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days.

sernst commented 1 year ago

Will reengage on the associated PR soon.