Closed mariomac closed 1 year ago
@grcevski I see the out-of order problem. Let me rethink the current implementation to see how we can minimize the chance.
Anyway I'm 100% not sure if we are currently safe against out-of-order processing, as eBPF programs can be executed in different CPUs
@grcevski I see the out-of order problem. Let me rethink the current implementation to see how we can minimize the chance.
Anyway I'm 100% not sure if we are currently safe against out-of-order processing, as eBPF programs can be executed in different CPUs
Ah good point! So I was thinking in the meantime, perhaps we need to expose the BPF map to the go space with the ongoing http and grpc requests. When it comes time to create the nested span, we look up the maps to see if we should first create the parent span or should we simply continue with context.TODO. The parent span when it's event comes through, we look up to see in our in memory cache if someone made it before and if they did, we'll pick that up and end it.
If you think this might work I think we are good to go with this change. We can expose this later. It would mean that the trace span creator code is able to look up the ongoing goroutine ids. An HTTP client call to say an external service, will be able to check if there's gRPC server span with the same goroutine id and do the right thing.
@grcevski you got a good point. The ongoing requests map can be exposed easily to Go. We just need to make sure that, in the case we need to pick the data up from the Go userspace, nobody elses remove it (e.g. when the ongoing request has ended)
Here's an example of what I do now.
It's inefficient because I send two events for the server span, when we start tracking and on return. But in theory all we need is the goroutine id -> start time, when the client span is processed.
Merging #84 (905efe3) into main (7ad75d7) will decrease coverage by
7.64%
. The diff coverage is1.21%
.
@@ Coverage Diff @@
## main #84 +/- ##
==========================================
- Coverage 46.58% 38.94% -7.64%
==========================================
Files 15 20 +5
Lines 1172 1330 +158
==========================================
- Hits 546 518 -28
- Misses 577 767 +190
+ Partials 49 45 -4
Flag | Coverage Δ | |
---|---|---|
unittests | 38.94% <1.21%> (-7.64%) |
:arrow_down: |
Flags with carried forward coverage won't be shown. Click here to find out more.
Impacted Files | Coverage Δ | |
---|---|---|
pkg/ebpf/common/bpf_bpf.go | 0.00% <0.00%> (ø) |
|
pkg/ebpf/common/ringbuf.go | 0.00% <0.00%> (ø) |
|
pkg/ebpf/grpc/bpf_debug_bpfel_x86.go | 0.00% <0.00%> (ø) |
|
pkg/ebpf/grpc/grpc.go | 0.00% <0.00%> (ø) |
|
pkg/ebpf/nethttp/bpf_debug_bpfel_x86.go | 0.00% <ø> (ø) |
|
pkg/ebpf/nethttp/nethttp.go | 0.00% <0.00%> (ø) |
|
pkg/ebpf/tracer.go | 0.00% <0.00%> (ø) |
|
pkg/export/otel/metrics.go | 78.33% <ø> (ø) |
|
pkg/goexec/instructions.go | 0.00% <0.00%> (ø) |
|
pkg/goexec/offsets.go | 0.00% <0.00%> (ø) |
|
... and 3 more |
@grcevski you got a good point. The ongoing requests map can be exposed easily to Go. We just need to make sure that, in the case we need to pick the data up from the Go userspace, nobody elses remove it (e.g. when the ongoing request has ended)
OK, cool, that's what I thought. So the only remaining problem is processing the events on the buffer and the ring buffers are behind, so the map is already cleaned up. I wonder though, maybe all we need is to send more data for the client span, parent ID and start time, perhaps the start time can even be adjusted later.
I think I'm OK with this, we'll find a solution through some of these ideas.
Thanks Mario!
This will allow start adding new code in separate modules (e.g. PR #85). This code works but we will still need some improvements, which I'd suggest to add in future PRs so we minimize merge conflicts with other tasks going in parallel.
To be done in following PRs:
tracer.go
file, which is now more testable as we have removed the eBPF code from it