Open Gandem opened 18 hours ago
235 could be related to the expiry of active fields
I don't think so. My tests only ran a few minutes, which is way before any lifetime kicks in.
The lifetimes of individual cache entries can be "refreshed" with AddWithLifetime
.
But before going that route, iirc the idea of having lifetimes is to allow wiping unused entries when the cache isn't full, so that the GC can do its work.
If regularly used entries are expired, the entries should be reinserted as soon as seen/required. Maybe there is a logic gap that needs to be closed here (need to take a deeper look into the code).
If regularly used entries are expired, the entries should be reinserted as soon as seen/required. Maybe there is a logic gap that needs to be closed here (need to take a deeper look into the code).
Gotcha, thanks for the context 👍 If that's the intent, then the issue likely lies in when ExecutableMetadata
is called.
In processinfo, it's called once every 6 hours, due to the elfInfoCache
:
https://github.com/open-telemetry/opentelemetry-ebpf-profiler/blob/e40ccaeb3a632f1beb74bc7a7828fb2a63db0412/processmanager/processinfo.go#L249-L253
In dotnet, it's called only when mappings are synchronized: https://github.com/open-telemetry/opentelemetry-ebpf-profiler/blob/9c8ddcebb40ad8091b09ce5e056f14c2e15ac54e/interpreter/dotnet/instance.go#L628
So right now, nothing guarantees that entries are being reinserted, AFAICT.
Design wise though, I don't see how we could enforce that all callers of ExecutableMetadata
are frequently inserting the data again in the cache?
Sorry, now I get your point of the elfInfoCache and the executables cache being out of sync.
processinfo.go Both caches are for different purposes, but need to be synced. The simplest options might be to drop calling reporter.ExecutableKnown() plus using the same lifetime for both caches. Or instead of syncing the lifetimes, we drop the lifetime completely from the executables cache and live with the small amounts of stale items.
In the dotnet code we have a similar pattern (globalPeCache with 6h TTL).
What other options do we have that won't add too much of complexity?
Another slightly different variant to synchronize caches:
I took a deeper look as well, and it looks like getElfInfo
is called by processNewExecMapping
, which in turn is called by synchronizeMappings
, but only for new mappings:
https://github.com/open-telemetry/opentelemetry-ebpf-profiler/blob/6ad74a74c228af2e1ae0063739598b48a30eda9e/processmanager/processinfo.go#L473-L476
So even if we synchronized elfInfoCache
with the executables
cache, we'd still have the issue. Is my understanding correct?
In the short term, it sounds like the easiest approach would be to drop the lifetime on executables
, what do you think?
In the long term though, I'm not sure we can enforce that all callers of ExecutableMetadata are frequently inserting the data again in the cache. One solution would be to have a cache structure that doesn't expire actively used data.
In the case of the executables cache, since it's file ID -> (file name, build ID)
, taking the assumptions that file IDs and build IDs are unique, we'd probably be okay with storing this data indefinitely in the cache (without needing to refresh it) as long as it's actively being used.
One solution would be to have a cache structure that doesn't expire actively used data.
Opened an issue for discussion: https://github.com/elastic/go-freelru/issues/63
Once we agree, a LRU API can be implemented relatively quickly. Should we go that route?
What happened?
After the profiler has seen an executable for one hour, mappings for native frames no longer contain the file name field and gnu build id attribute.
Expected behavior: This might happen during startup before the ELF files are parsed/indexed for an executable, but should not happen after multiple minutes of run time.
Reproducing
On latest
main
(https://github.com/open-telemetry/opentelemetry-ebpf-profiler/commit/9c8ddcebb40ad8091b09ce5e056f14c2e15ac54e):sudo ./ebpf-profiler -collection-agent=127.0.0.1:11000 -disable-tls -reporter-interval=1m
ebpf-profiler
itself)Root cause
In the OTLP reporter, executables have a one hour lifetime (since https://github.com/open-telemetry/opentelemetry-ebpf-profiler/pull/215): https://github.com/open-telemetry/opentelemetry-ebpf-profiler/blob/e40ccaeb3a632f1beb74bc7a7828fb2a63db0412/reporter/otlp_reporter.go#L158
As a consequence, in
elastic/go-freelru
, theexpire
field is set to an hour from the first time the executable is seen: https://github.com/elastic/go-freelru/blob/df0880f59e375f4a1a965eb98c1672e49fc22dce/lru.go#L55-L57In the current implementation, this means that the cache entry will get cleaned up after one hour, regardless of the latest time it was used or seen, since this
expire
field is only set when the entry is added to the cache in https://github.com/elastic/go-freelru/blob/df0880f59e375f4a1a965eb98c1672e49fc22dce/lru.go#L338, but never updated ulteriorly (unless the cache entry is updated explicitly).The cache is populated in: https://github.com/open-telemetry/opentelemetry-ebpf-profiler/blob/e40ccaeb3a632f1beb74bc7a7828fb2a63db0412/processmanager/processinfo.go#L315
This function is only called if the ELF file is not in the elfInfoCache: https://github.com/open-telemetry/opentelemetry-ebpf-profiler/blob/e40ccaeb3a632f1beb74bc7a7828fb2a63db0412/processmanager/processinfo.go#L249-L253
The TTL of that cache is 6 hours: https://github.com/open-telemetry/opentelemetry-ebpf-profiler/blob/e40ccaeb3a632f1beb74bc7a7828fb2a63db0412/processmanager/manager.go#L45
As a consequence, after 1h of continuous runtime for an executable, it is removed from the reporter's
executables
cache, but not re-added until 5h later, leading to file name and gnu build id fields missing.Alternative reproduction
It's possible to reproduce this by lowering the lifetime of the executable cache to 1 min. You can find below a helper grpc python server that showcases the issue:
Python gRPC server script to showcase the issue
```py from concurrent import futures import logging import datetime import grpc import opentelemetry.proto.collector.profiles.v1experimental.profiles_service_pb2 as otelprotodef import opentelemetry.proto.collector.profiles.v1experimental.profiles_service_pb2_grpc as otelproto class Exporter(otelproto.ProfilesService): hash_to_exec_info = {} def Export(self, request: otelprotodef.ExportProfilesServiceRequest, target): print(datetime.datetime.now(), "Received message!") for resource in request.resource_profiles: for scope in resource.scope_profiles: for container in scope.profiles: profile = container.profile string_table = profile.string_table attribute_table = profile.attribute_table mappings = profile.mapping for mapping in mappings: htl_hash, build_id = "", "" filename = string_table[mapping.filename] for attribute in mapping.attributes: if attribute_table[attribute].key == "process.executable.build_id.htlhash": htl_hash = attribute_table[attribute].value.string_value if attribute_table[attribute].key == "process.executable.build_id.gnu": build_id = attribute_table[attribute].value.string_value if build_id != "" or (filename != "UNKNOWN" and filename != ""): self.hash_to_exec_info[htl_hash] = (build_id, filename) if self.hash_to_exec_info.get(htl_hash, None) is not None and build_id == "" and (filename == "" or filename == "UNKNOWN"): print(datetime.datetime.now(), "Lost exec info data for executable", self.hash_to_exec_info[htl_hash], build_id, filename) return otelprotodef.ExportProfilesServiceResponse() def serve(): port = "8000" server = grpc.server(futures.ThreadPoolExecutor(max_workers=10)) otelproto.add_ProfilesServiceServicer_to_server(Exporter(), server) server.add_insecure_port("[::]:" + port) server.start() print("Server started, listening on " + port) server.wait_for_termination() if __name__ == "__main__": logging.basicConfig() serve() ```Sample output when running after 1min
```shell 2024-11-19 08:58:29.933295 Lost exec info data for executable ('ca3ef4796503671b33573ac8621ffd9c4c237ec7', 'process-agent') UNKNOWN 2024-11-19 08:58:29.933319 Lost exec info data for executable ('490fef8403240c91833978d494d39e537409b92e', 'libc.so.6') UNKNOWN 2024-11-19 08:58:29.933332 Lost exec info data for executable ('80541b54620ad5e5bb1011b3cf4573c3e4428c9c', 'libjvm.so') UNKNOWN 2024-11-19 08:58:29.933435 Lost exec info data for executable ('d1f40ab9db7631ede9d147f878b1c8e3f515c730', 'agent') UNKNOWN 2024-11-19 08:58:29.933448 Lost exec info data for executable ('912362c73f5bc409aad92bd06447d1642a3e7b0e', 'trace-agent') UNKNOWN 2024-11-19 08:58:29.933459 Lost exec info data for executable ('b1169cdfa04f518b407de97242d9e07568ff743b', 'ebpf-profiler') UNKNOWN 2024-11-19 08:58:29.933470 Lost exec info data for executable ('30840b79ac329ecbf1dec0bb60180eed256d319f', 'libz.so.1.2.11') UNKNOWN 2024-11-19 08:58:29.933482 Lost exec info data for executable ('a38e3be1b89000e883290cc72b54b5635fb97043', 'sshd') UNKNOWN ```Potential fixes
A couple potential options/questions that come to mind:
go-freelru
doesn't take into account whether the cache entry is actively being used before expiring it? We could, for example, add an option ingo-freelru
to have the expiry lifetime updated when the cache entry is used, so that an actively used entry is not forcefully removed.executables
cache's lifetime inprocessinfo
(and elsewhere) to ensure that the cache entry is updated. This will have the side effect of extending its lifetime in the cache, and not removing it if it's actively used.Let me know whether there's any preference as to how we should fix this. Happy to open a PR in the relevant repository 🙇 !
Notes
It's unclear to me whether this is related to https://github.com/open-telemetry/opentelemetry-ebpf-profiler/issues/235. I suspect (though I would need to dig further to prove this), that https://github.com/open-telemetry/opentelemetry-ebpf-profiler/issues/235 could be related to the expiry of active fields in the
frames
cache in the reporter.