grafana / beyla

eBPF-based autoinstrumentation of web applications and network metrics
https://grafana.com/oss/beyla-ebpf/
Apache License 2.0
1.23k stars 78 forks source link

Avoid that processes not matched by discovery are instrumented #376

Closed mariomac closed 7 months ago

mariomac commented 7 months ago

Once Beyla instrumented an executable file, all the instances of that executable were instrumented, even if the user only selected one given process (e.g. by port).

This can be an importan issue if run multiple services in e.g. in python or node but only want to instrument one of them, and not all the services run by the python or node executable.

This PR makes Beyla to account the PIDs of the processes that match the discovery selection criteria, and filters the traces from these processes that are not in that group.

grcevski commented 7 months ago

Sounds great, just mark it when it's ready for review :)!

codecov-commenter commented 7 months ago

Codecov Report

Attention: 49 lines in your changes are missing coverage. Please review.

Comparison is base (2c2ea40) 80.84% compared to head (656d857) 80.56%.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #376 +/- ## ========================================== - Coverage 80.84% 80.56% -0.28% ========================================== Files 53 55 +2 Lines 4120 4318 +198 ========================================== + Hits 3331 3479 +148 - Misses 609 655 +46 - Partials 180 184 +4 ``` | [Flag](https://app.codecov.io/gh/grafana/beyla/pull/376/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=grafana) | Coverage Δ | | |---|---|---| | [integration-test](https://app.codecov.io/gh/grafana/beyla/pull/376/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=grafana) | `69.47% <80.29%> (+0.20%)` | :arrow_up: | | [unittests](https://app.codecov.io/gh/grafana/beyla/pull/376/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=grafana) | `42.82% <25.54%> (-0.45%)` | :arrow_down: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=grafana#carryforward-flags-in-the-pull-request-comment) to find out more. | [Files](https://app.codecov.io/gh/grafana/beyla/pull/376?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=grafana) | Coverage Δ | | |---|---|---| | [pkg/internal/discover/finder.go](https://app.codecov.io/gh/grafana/beyla/pull/376?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=grafana#diff-cGtnL2ludGVybmFsL2Rpc2NvdmVyL2ZpbmRlci5nbw==) | `91.42% <100.00%> (ø)` | | | [pkg/internal/discover/typer.go](https://app.codecov.io/gh/grafana/beyla/pull/376?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=grafana#diff-cGtnL2ludGVybmFsL2Rpc2NvdmVyL3R5cGVyLmdv) | `89.62% <100.00%> (+0.63%)` | :arrow_up: | | [pkg/internal/discover/watcher.go](https://app.codecov.io/gh/grafana/beyla/pull/376?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=grafana#diff-cGtnL2ludGVybmFsL2Rpc2NvdmVyL3dhdGNoZXIuZ28=) | `90.90% <100.00%> (-3.04%)` | :arrow_down: | | [pkg/internal/ebpf/common/ringbuf.go](https://app.codecov.io/gh/grafana/beyla/pull/376?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=grafana#diff-cGtnL2ludGVybmFsL2VicGYvY29tbW9uL3JpbmdidWYuZ28=) | `85.05% <100.00%> (ø)` | | | [pkg/internal/ebpf/common/spanner.go](https://app.codecov.io/gh/grafana/beyla/pull/376?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=grafana#diff-cGtnL2ludGVybmFsL2VicGYvY29tbW9uL3NwYW5uZXIuZ28=) | `78.04% <100.00%> (+1.42%)` | :arrow_up: | | [pkg/internal/ebpf/httpfltr/httpfltr\_transform.go](https://app.codecov.io/gh/grafana/beyla/pull/376?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=grafana#diff-cGtnL2ludGVybmFsL2VicGYvaHR0cGZsdHIvaHR0cGZsdHJfdHJhbnNmb3JtLmdv) | `100.00% <100.00%> (ø)` | | | [pkg/internal/request/span.go](https://app.codecov.io/gh/grafana/beyla/pull/376?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=grafana#diff-cGtnL2ludGVybmFsL3JlcXVlc3Qvc3Bhbi5nbw==) | `100.00% <ø> (ø)` | | | [pkg/internal/ebpf/goruntime/goruntime.go](https://app.codecov.io/gh/grafana/beyla/pull/376?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=grafana#diff-cGtnL2ludGVybmFsL2VicGYvZ29ydW50aW1lL2dvcnVudGltZS5nbw==) | `95.91% <87.50%> (-4.09%)` | :arrow_down: | | [pkg/internal/ebpf/gosql/gosql.go](https://app.codecov.io/gh/grafana/beyla/pull/376?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=grafana#diff-cGtnL2ludGVybmFsL2VicGYvZ29zcWwvZ29zcWwuZ28=) | `91.30% <87.50%> (-2.82%)` | :arrow_down: | | [pkg/internal/ebpf/grpc/grpc.go](https://app.codecov.io/gh/grafana/beyla/pull/376?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=grafana#diff-cGtnL2ludGVybmFsL2VicGYvZ3JwYy9ncnBjLmdv) | `97.36% <88.23%> (-2.64%)` | :arrow_down: | | ... and [5 more](https://app.codecov.io/gh/grafana/beyla/pull/376?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=grafana) | |

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

mariomac commented 7 months ago

@grcevski logging the received traces at the client side, I see that neither the Host PID nor the User PID aren't recognized, so that's why the traces are filtered out.

It's strange because the reported PIDs are always 13 (on my local docker) and there isn't any process with that ID. Maybe we aren't setting them correctly or we are pointing to a wrong memory zone in the BPF side?

grcevski commented 7 months ago

@grcevski logging the received traces at the client side, I see that neither the Host PID nor the User PID aren't recognized, so that's why the traces are filtered out.

That's interesting! The eBPF side let it go now, which means we matched it correctly based on the namespaced PID on the eBPF side, produced the event, but the user space is filtering it out. Likely the pids we collect at event generation are not right.

I also noticed one more issue at the time we run the process discovery loop. We record the active port numbers, but with client programs they will grow to large arrays because it seems we are capturing the ephemeral ports too. See how the list grows in the output:

msg="new process watching events" component=discover.Watcher interval=500ms events="[{Type:0 Obj:{pid:1 openPorts:[33348 33400 33332 33318 33388 33312 33372 33356 33324 40604 40608 33338]}}]"
grcevski commented 7 months ago

OK, I reproduced it with some extra debugging information. For me it seems the node client is spawning new processes to handle the outgoing request:

msg="Found event" component=httpfltr.Tracer "Host PID"=1265 "NS Pid"=1265 "NS ID"=4026531836
msg="Found event" component=httpfltr.Tracer "Host PID"=1276 "NS Pid"=1276 "NS ID"=4026531836
...

It seems our bpf code which is matching based on parent pid is what works.

grcevski commented 7 months ago

Ah there's more to it, the bfp side sees it as the correct single process as it should be, it's always that host id. I think we are reading the host pid wrong in the task_pid function.

node-3418467 [008] d..31 248011.792338: bpf_trace_printk: Sending client buffer GET / HTTP/1.1
mariomac commented 7 months ago

@grcevski

I also noticed one more issue at the time we run the process discovery loop. We record the active port numbers, but with client programs they will grow to large arrays because it seems we are capturing the ephemeral ports too. See how the list grows in the output:

Yeah, this will be handled in the next PR, when we handle the removal of processes after they stop existing. Also the closed ports will be removed.