Open mjp41 opened 10 months ago
@licenser @darach are you still using snmalloc. If so do you have any benchmarks that represent your workload? We have some ideas that would benefit your kind of workload for Tremor.
Hi Matt,
We sure are. In tremor, and I have a project at work ( axiom now ) where I think it's a great fit.
We'll ask the community as well
Cheers,
Darach.
On Thu, Sep 14, 2023, 22:15 Matthew Parkinson @.***> wrote:
@Licenser https://github.com/Licenser @darach https://github.com/darach are you still using snmalloc. If so do you have any benchmarks that represent your workload? We have some ideas that would benefit your kind of workload for Tremor.
— Reply to this email directly, view it on GitHub https://github.com/microsoft/snmalloc/issues/634#issuecomment-1720082479, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABX5MW5MOJBESOQQAOE7FLX2NQVNANCNFSM6AAAAAA4YXRFRI . You are receiving this because you were mentioned.Message ID: @.***>
Awesome thanks. Any benchmarks we can run would really help us in justifying the engineering work.
Our CI benchmarks are snmalloc based https://www.tremor.rs/benchmarks/ - relatively boring UI there. The benchmarks we ran a year or two ago for tremor ( and mimalloc before it, and jemalloc before that ) are here: https://github.com/tremor-rs/tremor-runtime/blob/main/bench/README.md
They have changed enough though in themselves and we rewrote the runtime and connectors so the benchmarking code works differently so YMMV
I've spent a while playing with various cache strategies (though nothing very sophisticated, just sort of seeing what the low-hanging fruit was like). For the two workloads I've tried, for three interesting choices of hashes, I currently have these message counts:
Workload | No caching | Perfect assembly | 4-way direct hash |
---|---|---|---|
msgpass |
7,205,380 | 552,781; 35 rings | 1,076,843 |
xmalloc-test |
2,388,551 | 317,057; 5 rings | 317,065 |
The hash here is inspired by https://github.com/skeeto/hash-prospector but is sort of "a third" of that:
hash = slab;
hash *= 0x7feb352d;
return (hash >> 16) & 3;
I'm sure it's possible to do better, but that seems to work alright, though it's not sensitive to the upper bits of the slab! Of note, however, changing the shift to hash >> 30
or performing a prospector
-esque xor-shift prior to multiplication performs significantly worse on msgpass
(and a little worse on xmalloc-test
). The full prospector
hash also does a little worse than the numbers above and is, obviously, a fair bit more expensive.
@nwf-msr observed that we could improve the performance of remote deallocation if the producer does more work on building lists for each slab before returning to the original allocator. This could improve producer/consumer scenarios further.