open-telemetry / opentelemetry-go-instrumentation

OpenTelemetry Auto Instrumentation using eBPF
https://opentelemetry.io
Apache License 2.0
416 stars 65 forks source link

Reduce pointers usage in TargetDetails #640

Open RonFed opened 5 months ago

RonFed commented 5 months ago

TargetDetails seems to use pointers without a good reason, and it might be related to the out-of-memory error seen in #619. For large binaries analysis, the slices and maps used can get large, and removing the pointers might be an improvement.

iblancasa commented 1 week ago

Hi @RonFed. I'm trying to instrument a big executable running in Kubernetes and I got OOM Killed all the time in the opentelemetry-go-instrumentation container. I increased the resource limits a lot but it seems it is not enough. I think I'm facing this issue (I have been doing some debugging).

I would like to contribute but I need some hints. Can you provide some light? Thanks!

RonFed commented 1 week ago

Hey @iblancasa, thank you for your interest. What is approximately the memory limit you saw exceeded? This is an interesting topic, and I'd start by profiling memory usage (pprof) in a local setup to get the root cause. The TargetDetails struct looked like a good candidate for the problem but I didn't get the chance to confirm that. Another place that might be relevant is the structfield package which stores an offset mapping of relevant structs for instrumentation.

iblancasa commented 1 week ago

What is approximately the memory limit you saw exceeded?

I was trying to do some experiments with the OpenTelemetry Operator to autoinstrument an OpenTelemetry Collector. So... I added 2Gb to the pod as limits and it is OOMKilled. I reduced the size of my collector reducing the number of components and I was able to execute some extra statements in the instrumentation but I was not able to load the probes https://github.com/open-telemetry/opentelemetry-go-instrumentation/blob/9882b86f52d8daf168efee68ddc4442d2acd821f/internal/pkg/instrumentation/manager.go#L207-L214

After reading the comments, I think the issue you described here can be related.

Another place that might be relevant is the structfield package which stores an offset mapping of relevant structs for instrumentation.

I agree. But I have been printing the memory usage until reaching these lines: https://github.com/open-telemetry/opentelemetry-go-instrumentation/blob/9882b86f52d8daf168efee68ddc4442d2acd821f/internal/pkg/instrumentation/manager.go#L207-L214 And it is around 25MB. After the load is done, the pod is killed by Kubernetes.

RonFed commented 1 week ago

@iblancasa Are you setting OTEL_GO_AUTO_SHOW_VERIFIER_LOG env var? I think this can cause large memory allocations as well.

iblancasa commented 1 week ago

I'm not setting that environment variable.

RonFed commented 1 week ago

I tried to reproduce this. Instrumenting the collector, the max memory allocated by the instrumentation is ~120MB in my setup.

iblancasa commented 4 days ago

Oh. Maybe I'm doing something wrong. I'll try again. Thanks!

iblancasa commented 4 days ago

I just tried again and it seems I reproduce the problem 100% of the time. I'm using a container image based on Fedora. The last log message I see is this:

{"level":"info","ts":1719839515.3163679,"logger":"go.opentelemetry.io/auto","caller":"cli/main.go:117","msg":"starting instrumentation..."}
{"level":"info","ts":1719839515.3164241,"logger":"Instrumentation.Manager","caller":"instrumentation/manager.go:222","msg":"Mounting bpffs","allocations_details":{"StartAddr":140352138248192,"EndAddr":140352138772480,"NumCPU":16}}
{"level":"info","ts":1719839515.3165295,"logger":"Instrumentation.Manager","caller":"instrumentation/manager.go:208","msg":"loading probe","name":"google.golang.org/grpc/client"}

After that, it is OOKilled. I'll create a separate issue.