This rather large commit fundamentally modifies how unwind information is stored and retrieved. Until now we had a BPF hashmap where we stored a large C array in the values. The keys were used to index said values. We called each entry a "shard", where unwind information would be written. The way this worked was that whenever we had unwind information we wanted to store, we will append it to the current shard and/or continue appending in a new shard. We effectively treated this as a contiguous chunk of memory, that operated like a bump allocator.

As the unwind information we wanted to append might not fit in the shard, we would have to split or chunk said unwind information. In order to query it, we would search an array chunks with a linear search.

The above offers a good solution to maximising space utilisation, but it has a bunch of drawbacks, including:

once there is no storage space left, what to do is a difficult tradeoff to make. We were wiping the whole state after a few profiling rounds, to ensure we would get some samples. This would cause a large spike in CPU and memory usage, as we would have to re-generate the unwind info (it could have been cached, but this is not implemented yet), split it, and update it, issuing BPF map updates that could be quite large.
we would allocate all the memory up front, as the BPF maps configuration is set in stone once the programs are loaded. We were using around 400 MB which might be unsuitable for some environments. Dynamically probing the amount of free memory could be possible but it's brittle.
due to the above, reliability can be affected. For example, memory allocations can fail in high memory pressure situations.
removing the unwind information for an object file is not possible. Implementing this would be implementing a full-blown memory allocator on top of BPF maps, with all the issues this has, including having to deal with fragmentation, etc.

Lastly, we perform a fixed number of binary search iterations to find the unwind row for a given PC. This was limiting the maximum chunk size, which we could have dealt with by inserting extra chunks and partitioning the unwind info space, which brings us to...

Unwind information divided in equal chunks or pages, which is this implementation. The unwind info is split in 16 bit pages, this has a couple of advantages:

we can represent the BPF unwind row's PC using a 16 bit number, and then address each pages with the higher bits of the program counter. This has several benefits, the representation is more compact, allowing for more rows in every cache line, and we don't have to perform unaligned reads as the size of the structure fits in 8 bytes. Finding the right page can be done in O(1) if we index the page by (executable_id, high_page_bits).
binary searching over each of the pages is guaranteed to not require more than 16 steps.

The unwind information is not stored in a unwind info bucket. These are BPF 'maps of maps' that has a key they have the executable_id, and has a value they have a BPF array that stores the unwind information rows.

We need different buckets because the verifier requires that for each 'outer' map the 'inner' map must be configured before loading the program, so we can store up to X elements per object in each bucket.

This will allow us to remove / evict unwind information in a more granular way, doesn't require us to allocate large amount of locked memory, as we can allocate BPF arrays on the fly and store them in 'outer' maps, etc.

This commit also changes the following:

track when the reference count for object files drops to zero and evict them from their BPF maps.
remove process data from BPF maps on program exit.

Test Plan

Many manual runs and CI

javierhonduco / lightswitch

Rework native unwind information storage and retrieval #78

Test Plan