open-telemetry / opentelemetry-java-instrumentation

OpenTelemetry auto-instrumentation and instrumentation libraries for Java
https://opentelemetry.io
Apache License 2.0
1.98k stars 865 forks source link

Spans submitted to the same trace ID when using Flux.retryWhen #8432

Closed livmackintosh closed 1 year ago

livmackintosh commented 1 year ago

Describe the bug

Spans are submitted the same trace ID. The only thing that has changed has been upgrading from 1.19.0 (Honeycomb 1.4.0) to 1.25.0 (Honeycomb 1.5.0).[

When I first reported this bug I had thought that enabling the add-baggage flag -Dotel.instrumentation.logback-mdc.add-baggage=true was causing this problem however I have bisected the change and simply changing to 1.25.0 of the agent is causing the problem.

Steps to reproduce

What did you expect to see? A normal number of spans per trace within a root span

What did you see instead? Hundreds of thousands of spans and no root trace detected

What version are you using? v1.25.0

Environment Compiler: Eclipse Temurin JDK 17 OS: Ubuntu 22.04 ~Runtime (if different from JDK above): (e.g., "Oracle JRE 8u251")~ OS Alpine 3.17

Additional context We are using the Honeycomb distribution (v1.5.0) which is a thin wrapper around the OTel agent providing easy configuration for their cloud platform.

We are using auto-instrumentation

We are using -Dotel.instrumentation.kafka.experimental-span-attributes=true

livmackintosh commented 1 year ago

I have refined the issue a little bit and will continue investigating and trying to refine the bug report. cc: @laurit

livmackintosh commented 1 year ago

Bisected our changes and can confirm that the upgrade from 1.19.0 to 1.25.0 alone was the culprit.

livmackintosh commented 1 year ago

Now, I'm not sure if there are some breaking changes that we are not aware of... that's the current line of investigation to try to narrow down the problem further.

livmackintosh commented 1 year ago

Many if not all of the span spam comes from ReadOperationHandler class within spring-boot-actuator

mateuszrzeszutek commented 1 year ago

Can you say a bit more about the trace itself? It is leaking HTTP server spans? Or some background job ones? What related frameworks/libraries are you using?

livmackintosh commented 1 year ago

We're using Spring framework as our HTTP gateway but behind that we have Micronaut microservices however the span leaks seem to be occuring at many points in the stack but primarily within Spring. At the service level but also futher down in ReadOperationHandler.

Let me see if I can provide a minimally viable span and perhaps a list of libs that we are using .

livmackintosh commented 1 year ago

Dependencies within Spring Boot gateway

    implementation(group = "org.springframework.boot", name = "spring-boot-starter-webflux")
    implementation(group = "org.springframework.boot", name = "spring-boot-starter-actuator")
    implementation(group = "org.springframework.boot", name = "spring-boot-starter-validation")
    implementation(group = "org.jetbrains.kotlin", name = "kotlin-reflect")
    implementation(group = "org.jetbrains.kotlin", name = "kotlin-stdlib-jdk8")
    implementation(group = "org.jetbrains.kotlinx", name = "kotlinx-coroutines-reactor", version = coroutinesVersion)
    implementation(platform("io.arrow-kt:arrow-stack:$arrowVersion"))
    implementation("io.arrow-kt:arrow-core")
    implementation("io.arrow-kt:arrow-fx-stm")
    implementation("io.arrow-kt:arrow-fx-coroutines")
    implementation(group = "com.fasterxml.jackson.module", name = "jackson-module-kotlin", version = jacksonVersion)
    implementation("io.honeycomb:honeycomb-opentelemetry-sdk:$honeycombVersion")
    compileOnly("io.honeycomb:honeycomb-opentelemetry-javaagent:$honeycombVersion")
    implementation("io.grpc:grpc-netty-shaded:$grpcVersion")
    implementation("com.google.protobuf:protobuf-java-util:$protoVersion")
    implementation(group = "io.fabric8", name = "kubernetes-client", version = k8sClientVersion) // <- k8s client for using grpc endpoints watching
    runtimeOnly(group = "net.logstash.logback", name = "logstash-logback-encoder", version = logstashLogbackEncoder)
    testImplementation(group = "org.springframework.boot", name = "spring-boot-starter-test")
    testImplementation(group = "org.hamcrest", name = "hamcrest-all", version = hamcrestVersion)
    testImplementation(group = "org.mockito", name = "mockito-core", version = mockitoVersion)
    testImplementation(group = "org.mockito.kotlin", name = "mockito-kotlin", version = mockitoKotlinVersion)
    testImplementation(group = "org.mockito", name = "mockito-inline", version = mockitoInlineVersion)
    testImplementation(group = "io.cucumber", name = "cucumber-java", version = cucumberVersion)
    testImplementation(group = "io.cucumber", name = "cucumber-junit", version = cucumberVersion)
    testImplementation(group = "io.cucumber", name = "cucumber-spring", version = cucumberVersion)
    testRuntimeOnly(group = "org.junit.vintage", name = "junit-vintage-engine", version = junitVersion.toString())

Versions:

val micronautVersion by extra("3.4.4")
val protoVersion by extra("3.21.12")
val grpcVersion by extra("1.50.2")
val grpcKotlinVersion by extra("1.3.0")
val reactorVersion by extra("3.4.18")
val honeycombVersion by extra("1.4.0")
val arrowVersion by extra("1.1.2")
val junitVersion by extra("5.9.1")
val testContainersVersion by extra("1.17.5")
val coroutineTestVersion by extra("1.6.4")

The full dependency tree is rather large so I will refrain from posting it for now

livmackintosh commented 1 year ago

Potentially a dupe of interest https://github.com/open-telemetry/opentelemetry-java-instrumentation/issues/8125

laurit commented 1 year ago

@livmackintosh If you believe that what you are seeing is a bug and desire a fix for it then you need to either provide a minimal application that reproduces your issue or describe your issue in sufficient detail to allow others to reproduce it. This far the only details included are that you are seeing a lot of spans in one trace and that you are using spring, which is not enough to even guess where the issue might lie.

livmackintosh commented 1 year ago

@laurit Completely understand. I'm providing as much information as I can given time constraints and knowledge of our stack. Started last month at a new company so lots of moving parts and not familiar with Otel internals which are rather complicated or at least not simple. Bear with me and I will follow with more information and potentially a PR ;)

livmackintosh commented 1 year ago

Probably also related: https://github.com/open-telemetry/opentelemetry-java-instrumentation/issues/8454

livmackintosh commented 1 year ago

I'm going to test against 09b63d22944c86d8511658f0730cbbc7087ed0b2 as well as the parent commit. If this commit introduces the problem I will close this issue and consider #8125 to trump this. The reporter of this issue kindly added a repro so it may be more helpful. I can dig further into the root cause and either make a fix or test any related PRs.

NB: We're also using reactive streams.

livmackintosh commented 1 year ago

Closing as duped by #8125