Implement slab level cache for remote frees

mjp41 commented 10 months ago

@nwf-msr observed that we could improve the performance of remote deallocation if the producer does more work on building lists for each slab before returning to the original allocator. This could improve producer/consumer scenarios further.

### Tasks
- [ ] Implementing statistics to estimate performance wins.
- [ ] Implement version that doesn't work with snmalloc hardening.
- [ ] Implement version with hardening as well

mjp41 commented 10 months ago

@licenser @darach are you still using snmalloc. If so do you have any benchmarks that represent your workload? We have some ideas that would benefit your kind of workload for Tremor.

darach commented 10 months ago

Hi Matt,

We sure are. In tremor, and I have a project at work ( axiom now ) where I think it's a great fit.

We'll ask the community as well

Cheers,

Darach.

On Thu, Sep 14, 2023, 22:15 Matthew Parkinson @.***> wrote:

@Licenser https://github.com/Licenser @darach https://github.com/darach are you still using snmalloc. If so do you have any benchmarks that represent your workload? We have some ideas that would benefit your kind of workload for Tremor.

— Reply to this email directly, view it on GitHub https://github.com/microsoft/snmalloc/issues/634#issuecomment-1720082479, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABX5MW5MOJBESOQQAOE7FLX2NQVNANCNFSM6AAAAAA4YXRFRI . You are receiving this because you were mentioned.Message ID: @.***>

mjp41 commented 10 months ago

Awesome thanks. Any benchmarks we can run would really help us in justifying the engineering work.

darach commented 10 months ago

Our CI benchmarks are snmalloc based https://www.tremor.rs/benchmarks/ - relatively boring UI there. The benchmarks we ran a year or two ago for tremor ( and mimalloc before it, and jemalloc before that ) are here: https://github.com/tremor-rs/tremor-runtime/blob/main/bench/README.md

They have changed enough though in themselves and we rewrote the runtime and connectors so the benchmarking code works differently so YMMV

nwf-msr commented 10 months ago

I've spent a while playing with various cache strategies (though nothing very sophisticated, just sort of seeing what the low-hanging fruit was like). For the two workloads I've tried, for three interesting choices of hashes, I currently have these message counts:

Workload	No caching	Perfect assembly	4-way direct hash
`msgpass`	7,205,380	552,781; 35 rings	1,076,843
`xmalloc-test`	2,388,551	317,057; 5 rings	317,065

The hash here is inspired by https://github.com/skeeto/hash-prospector but is sort of "a third" of that:

  hash = slab;
  hash *= 0x7feb352d;
  return (hash >> 16) & 3;

I'm sure it's possible to do better, but that seems to work alright, though it's not sensitive to the upper bits of the slab! Of note, however, changing the shift to hash >> 30 or performing a prospector-esque xor-shift prior to multiplication performs significantly worse on msgpass (and a little worse on xmalloc-test). The full prospector hash also does a little worse than the numbers above and is, obviously, a fair bit more expensive.

microsoft / snmalloc

Implement slab level cache for remote frees #634