This rather large commit fundamentally modifies how unwind information
is stored and retrieved. Until now we had a BPF hashmap where we stored
a large C array in the values. The keys were used to index said values.
We called each entry a "shard", where unwind information would be
written. The way this worked was that whenever we had unwind information
we wanted to store, we will append it to the current shard and/or
continue appending in a new shard. We effectively treated this as a
contiguous chunk of memory, that operated like a bump allocator.
As the unwind information we wanted to append might not fit in the
shard, we would have to split or chunk said unwind information. In order
to query it, we would search an array chunks with a linear search.
The above offers a good solution to maximising space utilisation, but it
has a bunch of drawbacks, including:
once there is no storage space left, what to do is a difficult
tradeoff to make. We were wiping the whole state after a few profiling
rounds, to ensure we would get some samples. This would cause a large
spike in CPU and memory usage, as we would have to re-generate the
unwind info (it could have been cached, but this is not implemented
yet), split it, and update it, issuing BPF map updates that could be
quite large.
we would allocate all the memory up front, as the BPF maps configuration
is set in stone once the programs are loaded. We were using around 400
MB which might be unsuitable for some environments. Dynamically probing
the amount of free memory could be possible but it's brittle.
due to the above, reliability can be affected. For example, memory
allocations can fail in high memory pressure situations.
removing the unwind information for an object file is not possible.
Implementing this would be implementing a full-blown memory allocator
on top of BPF maps, with all the issues this has, including having to
deal with fragmentation, etc.
Lastly, we perform a fixed number of binary search iterations to find
the unwind row for a given PC. This was limiting the maximum chunk size,
which we could have dealt with by inserting extra chunks and
partitioning the unwind info space, which brings us to...
Unwind information divided in equal chunks or pages, which is this
implementation. The unwind info is split in 16 bit pages, this has a
couple of advantages:
we can represent the BPF unwind row's PC using a 16 bit number, and
then address each pages with the higher bits of the program counter.
This has several benefits, the representation is more compact, allowing
for more rows in every cache line, and we don't have to perform
unaligned reads as the size of the structure fits in 8 bytes. Finding
the right page can be done in O(1) if we index the page by
(executable_id, high_page_bits).
binary searching over each of the pages is guaranteed to not require
more than 16 steps.
The unwind information is not stored in a unwind info bucket. These are
BPF 'maps of maps' that has a key they have the executable_id, and has a
value they have a BPF array that stores the unwind information rows.
We need different buckets because the verifier requires that for each
'outer' map the 'inner' map must be configured before loading the
program, so we can store up to X elements per object in each bucket.
This will allow us to remove / evict unwind information in a more
granular way, doesn't require us to allocate large amount of locked
memory, as we can allocate BPF arrays on the fly and store them in
'outer' maps, etc.
This commit also changes the following:
track when the reference count for object files drops to zero and
evict them from their BPF maps.
remove process data from BPF maps on program exit.
This rather large commit fundamentally modifies how unwind information is stored and retrieved. Until now we had a BPF hashmap where we stored a large C array in the values. The keys were used to index said values. We called each entry a "shard", where unwind information would be written. The way this worked was that whenever we had unwind information we wanted to store, we will append it to the current shard and/or continue appending in a new shard. We effectively treated this as a contiguous chunk of memory, that operated like a bump allocator.
As the unwind information we wanted to append might not fit in the shard, we would have to split or chunk said unwind information. In order to query it, we would search an array chunks with a linear search.
The above offers a good solution to maximising space utilisation, but it has a bunch of drawbacks, including:
Lastly, we perform a fixed number of binary search iterations to find the unwind row for a given PC. This was limiting the maximum chunk size, which we could have dealt with by inserting extra chunks and partitioning the unwind info space, which brings us to...
Unwind information divided in equal chunks or pages, which is this implementation. The unwind info is split in 16 bit pages, this has a couple of advantages:
The unwind information is not stored in a unwind info bucket. These are BPF 'maps of maps' that has a key they have the executable_id, and has a value they have a BPF array that stores the unwind information rows.
We need different buckets because the verifier requires that for each 'outer' map the 'inner' map must be configured before loading the program, so we can store up to X elements per object in each bucket.
This will allow us to remove / evict unwind information in a more granular way, doesn't require us to allocate large amount of locked memory, as we can allocate BPF arrays on the fly and store them in 'outer' maps, etc.
This commit also changes the following:
Test Plan
Many manual runs and CI