Open RonFed opened 5 months ago
Hi @RonFed. I'm trying to instrument a big executable running in Kubernetes and I got OOM Killed all the time in the opentelemetry-go-instrumentation
container. I increased the resource limits a lot but it seems it is not enough. I think I'm facing this issue (I have been doing some debugging).
I would like to contribute but I need some hints. Can you provide some light? Thanks!
Hey @iblancasa, thank you for your interest.
What is approximately the memory limit you saw exceeded?
This is an interesting topic, and I'd start by profiling memory usage (pprof) in a local setup to get the root cause.
The TargetDetails struct looked like a good candidate for the problem but I didn't get the chance to confirm that.
Another place that might be relevant is the structfield
package which stores an offset mapping of relevant structs for instrumentation.
What is approximately the memory limit you saw exceeded?
I was trying to do some experiments with the OpenTelemetry Operator to autoinstrument an OpenTelemetry Collector. So... I added 2Gb to the pod as limits and it is OOMKilled
. I reduced the size of my collector reducing the number of components and I was able to execute some extra statements in the instrumentation but I was not able to load the probes https://github.com/open-telemetry/opentelemetry-go-instrumentation/blob/9882b86f52d8daf168efee68ddc4442d2acd821f/internal/pkg/instrumentation/manager.go#L207-L214
After reading the comments, I think the issue you described here can be related.
Another place that might be relevant is the structfield package which stores an offset mapping of relevant structs for instrumentation.
I agree. But I have been printing the memory usage until reaching these lines: https://github.com/open-telemetry/opentelemetry-go-instrumentation/blob/9882b86f52d8daf168efee68ddc4442d2acd821f/internal/pkg/instrumentation/manager.go#L207-L214 And it is around 25MB. After the load is done, the pod is killed by Kubernetes.
@iblancasa Are you setting OTEL_GO_AUTO_SHOW_VERIFIER_LOG
env var? I think this can cause large memory allocations as well.
I'm not setting that environment variable.
I tried to reproduce this. Instrumenting the collector, the max memory allocated by the instrumentation is ~120MB in my setup.
Oh. Maybe I'm doing something wrong. I'll try again. Thanks!
I just tried again and it seems I reproduce the problem 100% of the time. I'm using a container image based on Fedora. The last log message I see is this:
{"level":"info","ts":1719839515.3163679,"logger":"go.opentelemetry.io/auto","caller":"cli/main.go:117","msg":"starting instrumentation..."}
{"level":"info","ts":1719839515.3164241,"logger":"Instrumentation.Manager","caller":"instrumentation/manager.go:222","msg":"Mounting bpffs","allocations_details":{"StartAddr":140352138248192,"EndAddr":140352138772480,"NumCPU":16}}
{"level":"info","ts":1719839515.3165295,"logger":"Instrumentation.Manager","caller":"instrumentation/manager.go:208","msg":"loading probe","name":"google.golang.org/grpc/client"}
After that, it is OOKilled
.
I'll create a separate issue.
TargetDetails seems to use pointers without a good reason, and it might be related to the out-of-memory error seen in #619. For large binaries analysis, the slices and maps used can get large, and removing the pointers might be an improvement.