Per cpu cache of bigger sizes

thelex0 commented 4 months ago

Hi, Is it possible to increase the cache of each cpu to values larger than 512 KB? Our application doing allocations of 32, 64 size alot and the maximum tcmalloc cache depth for these classes is 2048, it would be great if there was an option to enable a 4MB block size to increase the CPU cache depth.

dvyukov commented 4 months ago

Hi @thelex0,

Currently it's not possible. Historically it wasn't possible due to fast path implications (the code packed 4 2-byte offsets into atomically updated 8-byte word). With my recent commit aa90692cc75ab154db7aec55a331619e87d7c426 it becomes theoretically possible. We can trade increase of header back to 8 bytes to effectively infinite slabs (up to 32GB). This may also open some other interesting possibilities since we will have that much space that we can move things around (using some scratch space), and e.g. pop batched with oldest objects (though, I still did not figure out how to do it exactly). cc @v-gogte @ckennelly

thelex0 commented 4 months ago

Hi @thelex0,

Currently it's not possible. Historically it wasn't possible due to fast path implications (the code packed 4 2-byte offsets into atomically updated 8-byte word). With my recent commit aa90692 it becomes theoretically possible. We can trade increase of header back to 8 bytes to effectively infinite slabs (up to 32GB). This may also open some other interesting possibilities since we will have that much space that we can move things around (using some scratch space), and e.g. pop batched with oldest objects (though, I still did not figure out how to do it exactly). cc @v-gogte @ckennelly

If I understand correctly, because of the new commit it is possible to make offsets of 4 bytes, which will increase possible size of the cpu cache? this is not a difficult feature, is there any way I can make feature request? This would be great for application with heap size of gigabytes and a lot of free memory.

dvyukov commented 4 months ago

If I understand correctly, because of the new commit it is possible to make offsets of 4 bytes, which will increase possible size of the cpu cache?

Correct. The problem is that 8-byte headers will consume 6 (4*85/64) additional cache lines in L1$. So we need to consider it a bit more carefully. If there are more good uses for this, then it would make the decision easier.

thelex0 commented 4 months ago

If I understand correctly, because of the new commit it is possible to make offsets of 4 bytes, which will increase possible size of the cpu cache?

Correct. The problem is that 8-byte headers will consume 6 (4*85/64) additional cache lines in L1$. So we need to consider it a bit more carefully. If there are more good uses for this, then it would make the decision easier.

Can you clarify a little what good use cases mean? as i understand extended slab sizes will be optimal for applications that: 1) many small allocations under kNumSmall: json parsers do allocations when building the DOM, and when the data is mostly numbers and floats, performance will be better drastically 2) batch processing and a kind of map-reduce & dataflow: when loading a large amount of data into memory, it will cause resources to be allocated, and then after processing that causes memory to be deallocated, this pattern will always overflow the per-cpu cache 3) after all big cache size reduce number of calls to middle-end, assuming that today's applications heap size couple of gigabytes it will not increase memory footprint

if it ends up hurting performance, then maybe it's a compile time feature?

ckennelly commented 4 months ago

Tuning options and compile time flags have considerable support costs in the long run (https://abseil.io/fast/52), since it increases the matrix of configurations that need to be tested/evaluated.

I would recommend first looking at how to reduce the number of allocations (for example: flat data structures [like absl::flat_hash_map], arenas, etc.). This both avoids roundtrips to new/delete, but is also better for cache locality improving application performance.

thelex0 commented 4 months ago

Tuning options and compile time flags have considerable support costs in the long run (https://abseil.io/fast/52), since it increases the matrix of configurations that need to be tested/evaluated.

I would recommend first looking at how to reduce the number of allocations (for example: flat data structures [like absl::flat_hash_map], arenas, etc.). This both avoids roundtrips to new/delete, but is also better for cache locality improving application performance.

yes I should consider using arenas, thanks. but from a theoretical point of view, arenas are not the same thing, but with an explicit access interface? it would be great if you could somehow estimate on Google's infrastructure how many applications only allocate memory on couple of classes, thus poorly utilizing tcmalloc, then I believe adding this feature would be justified

thelex0 commented 4 months ago

Hello, Is there a chance that this will be implemented in the near future or is there no chance because it is useless?

google / tcmalloc

Per cpu cache of bigger sizes #233