open-telemetry / opentelemetry-ebpf-profiler

The production-scale datacenter profiler (C/C++, Go, Rust, Python, Java, NodeJS, .NET, PHP, Ruby, Perl, ...)
Apache License 2.0
2.47k stars 273 forks source link

Native frames: file name and gnu build id missing after one hour of run time #244

Open Gandem opened 18 hours ago

Gandem commented 18 hours ago

What happened?

After the profiler has seen an executable for one hour, mappings for native frames no longer contain the file name field and gnu build id attribute.

Expected behavior: This might happen during startup before the ELF files are parsed/indexed for an executable, but should not happen after multiple minutes of run time.

Reproducing

On latest main (https://github.com/open-telemetry/opentelemetry-ebpf-profiler/commit/9c8ddcebb40ad8091b09ce5e056f14c2e15ac54e):

Root cause

In the OTLP reporter, executables have a one hour lifetime (since https://github.com/open-telemetry/opentelemetry-ebpf-profiler/pull/215): https://github.com/open-telemetry/opentelemetry-ebpf-profiler/blob/e40ccaeb3a632f1beb74bc7a7828fb2a63db0412/reporter/otlp_reporter.go#L158

As a consequence, in elastic/go-freelru, the expire field is set to an hour from the first time the executable is seen: https://github.com/elastic/go-freelru/blob/df0880f59e375f4a1a965eb98c1672e49fc22dce/lru.go#L55-L57

In the current implementation, this means that the cache entry will get cleaned up after one hour, regardless of the latest time it was used or seen, since this expire field is only set when the entry is added to the cache in https://github.com/elastic/go-freelru/blob/df0880f59e375f4a1a965eb98c1672e49fc22dce/lru.go#L338, but never updated ulteriorly (unless the cache entry is updated explicitly).

The cache is populated in: https://github.com/open-telemetry/opentelemetry-ebpf-profiler/blob/e40ccaeb3a632f1beb74bc7a7828fb2a63db0412/processmanager/processinfo.go#L315

This function is only called if the ELF file is not in the elfInfoCache: https://github.com/open-telemetry/opentelemetry-ebpf-profiler/blob/e40ccaeb3a632f1beb74bc7a7828fb2a63db0412/processmanager/processinfo.go#L249-L253

The TTL of that cache is 6 hours: https://github.com/open-telemetry/opentelemetry-ebpf-profiler/blob/e40ccaeb3a632f1beb74bc7a7828fb2a63db0412/processmanager/manager.go#L45

As a consequence, after 1h of continuous runtime for an executable, it is removed from the reporter's executables cache, but not re-added until 5h later, leading to file name and gnu build id fields missing.

Alternative reproduction

It's possible to reproduce this by lowering the lifetime of the executable cache to 1 min. You can find below a helper grpc python server that showcases the issue:

Python gRPC server script to showcase the issue ```py from concurrent import futures import logging import datetime import grpc import opentelemetry.proto.collector.profiles.v1experimental.profiles_service_pb2 as otelprotodef import opentelemetry.proto.collector.profiles.v1experimental.profiles_service_pb2_grpc as otelproto class Exporter(otelproto.ProfilesService): hash_to_exec_info = {} def Export(self, request: otelprotodef.ExportProfilesServiceRequest, target): print(datetime.datetime.now(), "Received message!") for resource in request.resource_profiles: for scope in resource.scope_profiles: for container in scope.profiles: profile = container.profile string_table = profile.string_table attribute_table = profile.attribute_table mappings = profile.mapping for mapping in mappings: htl_hash, build_id = "", "" filename = string_table[mapping.filename] for attribute in mapping.attributes: if attribute_table[attribute].key == "process.executable.build_id.htlhash": htl_hash = attribute_table[attribute].value.string_value if attribute_table[attribute].key == "process.executable.build_id.gnu": build_id = attribute_table[attribute].value.string_value if build_id != "" or (filename != "UNKNOWN" and filename != ""): self.hash_to_exec_info[htl_hash] = (build_id, filename) if self.hash_to_exec_info.get(htl_hash, None) is not None and build_id == "" and (filename == "" or filename == "UNKNOWN"): print(datetime.datetime.now(), "Lost exec info data for executable", self.hash_to_exec_info[htl_hash], build_id, filename) return otelprotodef.ExportProfilesServiceResponse() def serve(): port = "8000" server = grpc.server(futures.ThreadPoolExecutor(max_workers=10)) otelproto.add_ProfilesServiceServicer_to_server(Exporter(), server) server.add_insecure_port("[::]:" + port) server.start() print("Server started, listening on " + port) server.wait_for_termination() if __name__ == "__main__": logging.basicConfig() serve() ```
Sample output when running after 1min ```shell 2024-11-19 08:58:29.933295 Lost exec info data for executable ('ca3ef4796503671b33573ac8621ffd9c4c237ec7', 'process-agent') UNKNOWN 2024-11-19 08:58:29.933319 Lost exec info data for executable ('490fef8403240c91833978d494d39e537409b92e', 'libc.so.6') UNKNOWN 2024-11-19 08:58:29.933332 Lost exec info data for executable ('80541b54620ad5e5bb1011b3cf4573c3e4428c9c', 'libjvm.so') UNKNOWN 2024-11-19 08:58:29.933435 Lost exec info data for executable ('d1f40ab9db7631ede9d147f878b1c8e3f515c730', 'agent') UNKNOWN 2024-11-19 08:58:29.933448 Lost exec info data for executable ('912362c73f5bc409aad92bd06447d1642a3e7b0e', 'trace-agent') UNKNOWN 2024-11-19 08:58:29.933459 Lost exec info data for executable ('b1169cdfa04f518b407de97242d9e07568ff743b', 'ebpf-profiler') UNKNOWN 2024-11-19 08:58:29.933470 Lost exec info data for executable ('30840b79ac329ecbf1dec0bb60180eed256d319f', 'libz.so.1.2.11') UNKNOWN 2024-11-19 08:58:29.933482 Lost exec info data for executable ('a38e3be1b89000e883290cc72b54b5635fb97043', 'sshd') UNKNOWN ```

Potential fixes

A couple potential options/questions that come to mind:

Let me know whether there's any preference as to how we should fix this. Happy to open a PR in the relevant repository 🙇 !

Notes

It's unclear to me whether this is related to https://github.com/open-telemetry/opentelemetry-ebpf-profiler/issues/235. I suspect (though I would need to dig further to prove this), that https://github.com/open-telemetry/opentelemetry-ebpf-profiler/issues/235 could be related to the expiry of active fields in the frames cache in the reporter.

rockdaboot commented 16 hours ago

235 could be related to the expiry of active fields

I don't think so. My tests only ran a few minutes, which is way before any lifetime kicks in.

rockdaboot commented 16 hours ago

The lifetimes of individual cache entries can be "refreshed" with AddWithLifetime.

But before going that route, iirc the idea of having lifetimes is to allow wiping unused entries when the cache isn't full, so that the GC can do its work.

If regularly used entries are expired, the entries should be reinserted as soon as seen/required. Maybe there is a logic gap that needs to be closed here (need to take a deeper look into the code).

Gandem commented 15 hours ago

If regularly used entries are expired, the entries should be reinserted as soon as seen/required. Maybe there is a logic gap that needs to be closed here (need to take a deeper look into the code).

Gotcha, thanks for the context 👍 If that's the intent, then the issue likely lies in when ExecutableMetadata is called.

In processinfo, it's called once every 6 hours, due to the elfInfoCache: https://github.com/open-telemetry/opentelemetry-ebpf-profiler/blob/e40ccaeb3a632f1beb74bc7a7828fb2a63db0412/processmanager/processinfo.go#L249-L253

In dotnet, it's called only when mappings are synchronized: https://github.com/open-telemetry/opentelemetry-ebpf-profiler/blob/9c8ddcebb40ad8091b09ce5e056f14c2e15ac54e/interpreter/dotnet/instance.go#L628

So right now, nothing guarantees that entries are being reinserted, AFAICT.

Design wise though, I don't see how we could enforce that all callers of ExecutableMetadata are frequently inserting the data again in the cache?

rockdaboot commented 15 hours ago

Sorry, now I get your point of the elfInfoCache and the executables cache being out of sync.

processinfo.go Both caches are for different purposes, but need to be synced. The simplest options might be to drop calling reporter.ExecutableKnown() plus using the same lifetime for both caches. Or instead of syncing the lifetimes, we drop the lifetime completely from the executables cache and live with the small amounts of stale items.

In the dotnet code we have a similar pattern (globalPeCache with 6h TTL).

What other options do we have that won't add too much of complexity?

rockdaboot commented 15 hours ago

Another slightly different variant to synchronize caches:

Gandem commented 15 hours ago

I took a deeper look as well, and it looks like getElfInfo is called by processNewExecMapping, which in turn is called by synchronizeMappings, but only for new mappings: https://github.com/open-telemetry/opentelemetry-ebpf-profiler/blob/6ad74a74c228af2e1ae0063739598b48a30eda9e/processmanager/processinfo.go#L473-L476

So even if we synchronized elfInfoCache with the executables cache, we'd still have the issue. Is my understanding correct?

In the short term, it sounds like the easiest approach would be to drop the lifetime on executables, what do you think?

In the long term though, I'm not sure we can enforce that all callers of ExecutableMetadata are frequently inserting the data again in the cache. One solution would be to have a cache structure that doesn't expire actively used data.

In the case of the executables cache, since it's file ID -> (file name, build ID), taking the assumptions that file IDs and build IDs are unique, we'd probably be okay with storing this data indefinitely in the cache (without needing to refresh it) as long as it's actively being used.

rockdaboot commented 13 hours ago

One solution would be to have a cache structure that doesn't expire actively used data.

Opened an issue for discussion: https://github.com/elastic/go-freelru/issues/63

Once we agree, a LRU API can be implemented relatively quickly. Should we go that route?