open-telemetry / opentelemetry-go-instrumentation

OpenTelemetry Auto Instrumentation using eBPF
https://opentelemetry.io
Apache License 2.0
530 stars 81 forks source link

net/http context propagation appears flaky #220

Open damemi opened 1 year ago

damemi commented 1 year ago

The test app fixtures are showing a weird flakiness where parentSpanId is sometimes set (redacted to xxxxx), and sometimes empty.

First noticed in https://github.com/open-telemetry/opentelemetry-go-instrumentation/pull/213#issuecomment-1623937567. In this case, the PR did not have parentSpanId set, but the Github workflow output did.

After this, in that same PR, a commit was pushed to add parentSpanId. The Github workflow then failed again, trying to remove the parentSpanId value.

The parentSpanId field was first set in https://github.com/open-telemetry/opentelemetry-go-instrumentation/pull/91 for gorillamux and nethttp fixtures (https://github.com/open-telemetry/opentelemetry-go-instrumentation/pull/91/commits/56cbc7d9e41df12cfadc7f418267fbf5e0cc8b18).

Possible causes:

It could also be a combination of both, or something else entirely.

There is an open PR https://github.com/open-telemetry/opentelemetry-go-instrumentation/pull/118 which improves context propagation across goroutines. Since the sample apps all call themselves, the client+server spans are made under different goroutines of the same process. Maybe this has something to do with it?

Possibly relevant: open issue about http context propagation https://github.com/open-telemetry/opentelemetry-go-instrumentation/issues/79

oxeye-gal commented 1 year ago

Hi @damemi @MikeGoldsmith,

We looked into this issue a bit and found out the context propagation works fine until the point of the gorilla instrumentation. Meaning, the outputted traces from the gorilla test app appear to be broken down to two traces:

  1. Consists of two spans from net/http client and server
  2. Consists of single span from gorilla instrumentation

While looking into the gorilla instrumentation we couldn't see any attempt to propagate the existing context as opposed to other instrumentations that do that such as gRPC and net/http which holds an additional field for this called "psc" - parent span context.

Can you please elaborate on that?

Thanks!

damemi commented 1 year ago

@oxeye-gal thanks for looking into it, the fact that gorilla has its own span on top of the net/http is weird to me too. I believe it uses net/http under the hood, so maybe the ebpf is instrumenting functions from both libraries.

That's something that we should figure out, but it might be separate from what I'm describing in the original comment. We have seen this parentSpanId propagation fail for the net/http test too (https://github.com/open-telemetry/opentelemetry-go-instrumentation/actions/runs/5478807240/jobs/9980402767#step:18:28), so there may be an issue with the workflow or something not right in the propagation.

oxeye-gal commented 1 year ago

Hello @damemi, thank you for bringing this issue to our attention. We have investigated it further and were able to replicate it on our end. It appears that the root cause of the issue is related to how Go handles stack space. Upon entering a called function, Go will check if there is enough space for the goroutine to run through the function. If there isn’t enough space, Go will allocate additional space using the runtime.morestack_noctxt function and then jump to the first instruction of the called function. In our case, this triggers the eBPF uprobe once again.

   0x00000000005f1ad7 <+4215>:  call   0x464a40 <runtime.morestack_noctxt>
   0x00000000005f1adc <+4220>:  mov    0x8(%rsp),%rax
   0x00000000005f1ae1 <+4225>:  mov    0x10(%rsp),%rbx
   0x00000000005f1ae6 <+4230>:  jmp    0x5f0a60 <net/http.(*Client).do>

It’s important to note that the behavior of Go could impact other instrumentations, particularly those involving client instrumentation that modify the Go structures.

varshneydevansh commented 1 year ago

I did look for some good first issue but couldn't find any, if there are any issues where I can contribute to?