javierhonduco / lightswitch

CPU profiler for Linux written in Rust
MIT License
7 stars 3 forks source link

Improve unwind info persisting failure handling #23

Closed javierhonduco closed 7 months ago

javierhonduco commented 7 months ago

In some production machines, persisting the unwind info fails. We are currently investigating this and so far we don't know what the culprit is.

On those hosts we get pretty much 100% unwind errors, which should not happen. This leads me to notice that errors persisting the unwind info aren't handled properly.

For example, once a shard is full, the current code ignores this wipes the in-memory shard and assigns a new BPF shard. This is not correct.

Test Plan

Forced some errors in this logic and the current in-memory state wasn't wiped. We need failure injection during testing to ensure all these cases are covered and don't regress.

cc @gmarler