Open damemi opened 1 year ago
Hi @damemi @MikeGoldsmith,
We looked into this issue a bit and found out the context propagation works fine until the point of the gorilla instrumentation. Meaning, the outputted traces from the gorilla test app appear to be broken down to two traces:
While looking into the gorilla instrumentation we couldn't see any attempt to propagate the existing context as opposed to other instrumentations that do that such as gRPC and net/http which holds an additional field for this called "psc" - parent span context.
Can you please elaborate on that?
Thanks!
@oxeye-gal thanks for looking into it, the fact that gorilla has its own span on top of the net/http is weird to me too. I believe it uses net/http under the hood, so maybe the ebpf is instrumenting functions from both libraries.
That's something that we should figure out, but it might be separate from what I'm describing in the original comment. We have seen this parentSpanId
propagation fail for the net/http test too (https://github.com/open-telemetry/opentelemetry-go-instrumentation/actions/runs/5478807240/jobs/9980402767#step:18:28), so there may be an issue with the workflow or something not right in the propagation.
Hello @damemi, thank you for bringing this issue to our attention. We have investigated it further and were able to replicate it on our end.
It appears that the root cause of the issue is related to how Go handles stack space. Upon entering a called function, Go will check if there is enough space for the goroutine to run through the function. If there isn’t enough space, Go will allocate additional space using the runtime.morestack_noctxt
function and then jump to the first instruction of the called function. In our case, this triggers the eBPF uprobe once again.
0x00000000005f1ad7 <+4215>: call 0x464a40 <runtime.morestack_noctxt>
0x00000000005f1adc <+4220>: mov 0x8(%rsp),%rax
0x00000000005f1ae1 <+4225>: mov 0x10(%rsp),%rbx
0x00000000005f1ae6 <+4230>: jmp 0x5f0a60 <net/http.(*Client).do>
It’s important to note that the behavior of Go could impact other instrumentations, particularly those involving client instrumentation that modify the Go structures.
I did look for some good first issue but couldn't find any, if there are any issues where I can contribute to?
The test app fixtures are showing a weird flakiness where
parentSpanId
is sometimes set (redacted toxxxxx
), and sometimes empty.First noticed in https://github.com/open-telemetry/opentelemetry-go-instrumentation/pull/213#issuecomment-1623937567. In this case, the PR did not have
parentSpanId
set, but the Github workflow output did.After this, in that same PR, a commit was pushed to add
parentSpanId
. The Github workflow then failed again, trying to remove theparentSpanId
value.The
parentSpanId
field was first set in https://github.com/open-telemetry/opentelemetry-go-instrumentation/pull/91 forgorillamux
andnethttp
fixtures (https://github.com/open-telemetry/opentelemetry-go-instrumentation/pull/91/commits/56cbc7d9e41df12cfadc7f418267fbf5e0cc8b18).Possible causes:
It could also be a combination of both, or something else entirely.
There is an open PR https://github.com/open-telemetry/opentelemetry-go-instrumentation/pull/118 which improves context propagation across goroutines. Since the sample apps all call themselves, the client+server spans are made under different goroutines of the same process. Maybe this has something to do with it?
Possibly relevant: open issue about http context propagation https://github.com/open-telemetry/opentelemetry-go-instrumentation/issues/79