open-telemetry / opentelemetry-java

OpenTelemetry Java SDK
https://opentelemetry.io
Apache License 2.0
1.96k stars 806 forks source link

OTel Context is getting lost in GraphQL manual instrumentation #6583

Open govi20 opened 1 month ago

govi20 commented 1 month ago

I have an async GraphQL resolver which I am using along with a GraphQLdata loader. The GraphQL service performs manual instrumentation. The problem I am facing here is that OpenTelemetry context is getting lost in the load method. This is happening even before offloading the task on the context-wrapped executor.

Steps to reproduce

  1. Clone: https://github.com/govi20/dgs-otel
  2. Build and run the project.
  3. Access localhost:9090/graphiql
  4. execute the following GraphQL query and see the logs where I've printed Context in DepartmentDataLoader
query emploees {
  employees {
    id
    name
    department {
      id
      name
    }
  }
}

What did you expect to see? Otel Span and context should be available in the DepartmentDataLoader's load() method as there is no thread switch in between. the span's end method is not called so I assume the span is not closed either

What did you see instead? OTel Context is not getting propagated.

What version and what artifacts are you using? Version: 1.38.0, I use custom implementation of SpanExporter.

Environment This is not environment specific issue, it's reproducible on MacOS as well as CentOS.

Additional context I've reported this issue on GraphQL DGS Framework that I use: https://github.com/Netflix/dgs-framework/issues/1928 The Netflix DGS Folks have recommended me to check with tracing framework team.

govi20 commented 1 month ago

Let me know if a sample project with minimal reproducible code is required.

jack-berg commented 1 month ago

Let me know if a sample project with minimal reproducible code is required.

That would be very useful 🙂

govi20 commented 1 month ago

@jack-berg here is a sample project with a reproducible example https://github.com/govi20/dgs-otel

This is where the OTel Context gets lost => DepartmentDataloader

and the data loaders get called from here => EmployeeDataFetcher and there is no thread switch in between. The thread pool that I have configured executor that wraps the task using Context.taskWrapping API.

I've added steps to reproduce in the bug report.

jkwatson commented 1 month ago

It looks like dgs is based on graphql-java, which has library instrumentation (https://github.com/open-telemetry/opentelemetry-java-instrumentation/tree/main/instrumentation/graphql-java/graphql-java-20.0/library). I wonder if you just plug in that instrumentation library if a bunch of these issues would go away. Worth a shot, at least, to see how it does!

govi20 commented 1 month ago

@jkwatson does this library perform auto instrumentation? I need to rely on manual instrumentation logic

jkwatson commented 1 month ago

The "library" instrumentation doesn't require the javaagent. You can just plug it in programmatically. You might need to figure out how to hook it into dgs, but it doesn't need the agent.

jkwatson commented 1 month ago

Given this documentation, I would guess it'll work just fine: https://netflix.github.io/dgs/advanced/instrumentation/

govi20 commented 1 month ago

@jkwatson That's exactly what I am doing in my code, manually instrumenting using SimpleInstrumention implementation

jkwatson commented 1 month ago

@jkwatson That's exactly what I am doing in my code, manually instrumenting using SimpleInstrumention implementation

I recommend trying the library instrumentation then, and raising issues in the instrumentation repo if it has holes.

govi20 commented 1 month ago

@jkwatson Unfortunately, I can't use the library because it lacks the ability to instrument GraphQL data resolvers.

But still, I've tried it out, and the issue can indeed be reproducible with this library. I plan to report this issue to the instrumentation repo. However, I'm unsure if the problem lies within the core libraries or in the instrumentation.

jkwatson commented 1 month ago

The issue won't be with the core libraries. It's definitely an issue with making the instrumentation correctly propagate the context where it needs to go.