open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
3.11k stars 2.39k forks source link

hostmetrics featuring ebpf - resource efficient scraping #32446

Open cforce opened 7 months ago

cforce commented 7 months ago

Component(s)

receiver/hostmetrics

Is your feature request related to a problem? Please describe.

Create an more hardware resource efficient alternative for getting the host metrics (kernel/process) via eBPF If you aren't familiar with eBPF, you can read more about it on ebpf.io, but in short – eBPF allows us to execute sandboxed programs that extends the Linux kernel without having to change it. We can use eBPF to attach to a tracepoint event when a specific system call is made by a process.

Describe the solution you'd like

A: Run native(c++) program externally (deamon?) and let it sent to a receiver e.g https://github.com/Netflix/bpftop/ or B: Integrate eBPF scraping into go (might need target platform dep build) eg . by running using "ebpf trace scripts" as cfg see

Our new sensor uses Inspektor Gadget as its instrumentation layer - allowing us to collect events at the Kernel space and analyze them to provide security insights on workloads running in Kubernetes (insights include those from the host as well as at the container level).

Collect metrics via bpf traces and package as otlp metric message

Additional context

github-actions[bot] commented 7 months ago

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

kernelpanic77 commented 6 months ago

@dmitryax @braydonk any thoughts here ? This sounds like a pretty cool enhancement.

braydonk commented 6 months ago

This sounds like a cool idea, but I don't think it should be in this receiver or in contrib for a few reasons:

So I don't think this should be added to this receiver, but it's not a bad idea. This could instead be a receiver that is published independently that people can include in their own collector builds with the OpenTelemetry Collector Builder. Even better if that receiver could implement the Process Semantic Conventions that are nearing stabilization.

kernelpanic77 commented 6 months ago

@braydonk Yes, it make sense. We should not start any subprocesses from the collector repository, but I would be willing to contribute to a custom receiver specifically for eBPF. I think we can use this thread to discuss the same.

braydonk commented 6 months ago

@kernelpanic77 Some sort of eBPF receiver could be very cool. As long as:

Then that sort of receiver could work well in contrib.

You can find full guidance for introducing new components here: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/CONTRIBUTING.md#adding-new-components

Something that the docs above don't explicitly point out is that you can satisfy those 4 criteria for implementing a component in your own repo and build that into a collector yourself to experiment with implementations, which would help with the process stated in that document of how to add new components to contrib.

I'm not an authority on any of this, so there may be other restrictions I'm not mentioning, or even previous discussions of a component like this that I am not privy to. It may be a good idea to attend the Collector Working Group Meeting on Wednesday at 16:00 UTC. You can find the Zoom link in the OpenTelemetry Calendar. You are welcome to join and add to the agenda.

cforce commented 6 months ago

To implement the Start method for a eBPFReceiver in the OpenTelemetry Collector, we would need to Load and Attach eBPF Programs: What about using the the Cilium eBPF library to load an eBPF bytecode or compile it from C, and then attach uprobes and uretprobes to the desired functions.

Once the probes are attached, you'll need to collect the data they generate. This usually involves reading data from a BPF map or receiving events from the kernel.

Then transform the collected data into a format that OpenTelemetry understands and send it to the collector.

To avoid starting subprocesses for loading and compiling eBPF programs, can we embed the eBPF bytecode directly into the Go application to ensure the eBPF bytecode is part of the Go binary and can be loaded directly without external dependencies or subprocesses?

cforce commented 5 months ago

"The continuous profiling agent, that Elastic is donating, is based on eBPF and by that a whole system, always-on solution that observes code and third-party libraries, kernel operations, and other code you don’t own. It eliminates the need for code instrumentation (run-time/bytecode), recompilation, or service restarts with low overhead, low CPU (~1%), and memory usage in production environments." https://opentelemetry.io/blog/2024/elastic-contributes-continuous-profiling-agent/ How does this profiling with eBPF finally integrates with the collector? Don't you have very similar challenges as mentioned above, so the solution are there already?

braydonk commented 5 months ago

I don't know much about the agent or any particular plans to integrate it. If I had to guess, it's most plausible that this agent won't specifically integrate with the Collector, rather support the Collector as it would any OTLP destination (once the Collector allows OTel Profiles as a signal). There may be other plans I'm not aware of, but that is what would make the most sense to me. And in that scenario, the restrictions we talk about that make eBPF tricky for the OTel Collector wouldn't apply.

cforce commented 5 months ago

See https://github.com/elastic/otel-profiling-agent/issues/12

cforce commented 4 months ago

related cpu optimization https://github.com/shirou/gopsutil/pull/361

cforce commented 3 months ago

beyla as agent integration could be the way forward for better kernel level resource optimized zero code instrumentation https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/34321

cforce commented 2 months ago

related - profiling overhead - https://github.com/golang/go/issues/57175

github-actions[bot] commented 5 days ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

cforce commented 1 day ago

related https://github.com/open-telemetry/community/issues/2406