grafana / beyla

eBPF-based autoinstrumentation of web applications and network metrics
https://grafana.com/oss/beyla-ebpf/
Apache License 2.0
1.21k stars 77 forks source link

Splitting eBPF sampler code #84

Closed mariomac closed 1 year ago

mariomac commented 1 year ago

This will allow start adding new code in separate modules (e.g. PR #85). This code works but we will still need some improvements, which I'd suggest to add in future PRs so we minimize merge conflicts with other tasks going in parallel.

To be done in following PRs:

mariomac commented 1 year ago

@grcevski I see the out-of order problem. Let me rethink the current implementation to see how we can minimize the chance.

Anyway I'm 100% not sure if we are currently safe against out-of-order processing, as eBPF programs can be executed in different CPUs

grcevski commented 1 year ago

@grcevski I see the out-of order problem. Let me rethink the current implementation to see how we can minimize the chance.

Anyway I'm 100% not sure if we are currently safe against out-of-order processing, as eBPF programs can be executed in different CPUs

Ah good point! So I was thinking in the meantime, perhaps we need to expose the BPF map to the go space with the ongoing http and grpc requests. When it comes time to create the nested span, we look up the maps to see if we should first create the parent span or should we simply continue with context.TODO. The parent span when it's event comes through, we look up to see in our in memory cache if someone made it before and if they did, we'll pick that up and end it.

If you think this might work I think we are good to go with this change. We can expose this later. It would mean that the trace span creator code is able to look up the ongoing goroutine ids. An HTTP client call to say an external service, will be able to check if there's gRPC server span with the same goroutine id and do the right thing.

mariomac commented 1 year ago

@grcevski you got a good point. The ongoing requests map can be exposed easily to Go. We just need to make sure that, in the case we need to pick the data up from the Go userspace, nobody elses remove it (e.g. when the ongoing request has ended)

grcevski commented 1 year ago

Here's an example of what I do now.

https://github.com/grafana/ebpf-autoinstrument/pull/85/files#diff-e524f5b7f4b1cda88b689b542fa19e82ab4b408307f76225385b7c87ecb1f850R228

It's inefficient because I send two events for the server span, when we start tracking and on return. But in theory all we need is the goroutine id -> start time, when the client span is processed.

codecov-commenter commented 1 year ago

Codecov Report

Merging #84 (905efe3) into main (7ad75d7) will decrease coverage by 7.64%. The diff coverage is 1.21%.

@@            Coverage Diff             @@
##             main      #84      +/-   ##
==========================================
- Coverage   46.58%   38.94%   -7.64%     
==========================================
  Files          15       20       +5     
  Lines        1172     1330     +158     
==========================================
- Hits          546      518      -28     
- Misses        577      767     +190     
+ Partials       49       45       -4     
Flag Coverage Δ
unittests 38.94% <1.21%> (-7.64%) :arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pkg/ebpf/common/bpf_bpf.go 0.00% <0.00%> (ø)
pkg/ebpf/common/ringbuf.go 0.00% <0.00%> (ø)
pkg/ebpf/grpc/bpf_debug_bpfel_x86.go 0.00% <0.00%> (ø)
pkg/ebpf/grpc/grpc.go 0.00% <0.00%> (ø)
pkg/ebpf/nethttp/bpf_debug_bpfel_x86.go 0.00% <ø> (ø)
pkg/ebpf/nethttp/nethttp.go 0.00% <0.00%> (ø)
pkg/ebpf/tracer.go 0.00% <0.00%> (ø)
pkg/export/otel/metrics.go 78.33% <ø> (ø)
pkg/goexec/instructions.go 0.00% <0.00%> (ø)
pkg/goexec/offsets.go 0.00% <0.00%> (ø)
... and 3 more

... and 1 file with indirect coverage changes

grcevski commented 1 year ago

@grcevski you got a good point. The ongoing requests map can be exposed easily to Go. We just need to make sure that, in the case we need to pick the data up from the Go userspace, nobody elses remove it (e.g. when the ongoing request has ended)

OK, cool, that's what I thought. So the only remaining problem is processing the events on the buffer and the ring buffers are behind, so the map is already cleaned up. I wonder though, maybe all we need is to send more data for the client span, parent ID and start time, perhaps the start time can even be adjusted later.

I think I'm OK with this, we'll find a solution through some of these ideas.

Thanks Mario!