vsbuffalo / granges

A Rust library and command line tool for working with genomic ranges and their data.
90 stars 5 forks source link

add mimalloc #9

Open rob-p opened 1 month ago

rob-p commented 1 month ago

I created this branch to play around with potential optimizations. However, right now this PR just contains one (very trivial) optimization. It replaces the global allocator with Mimalloc. This is a highly-optimized allocator that generally outperforms the system malloc. Given the propensity of Granges to perform many small allocations, such a change seems to help. I'd be very interested in your measurements!

jianshu93 commented 1 month ago

Hi Rob,

I saw 5% to 10% improvement at least using my random test dataset. I think it is worth it. Thanks for bring the mimalloc into attention. I have some problem with jemalloc but it seems mimalloc is perfectly compatible with system default malloc while being faster.

Thanks,

Jianshu

molpopgen commented 1 month ago

I would suggest that binaries including granges should define their global allocators in the file containing main. Unless things have changed, you can run into a mess when library crates define global allocators. For example, if a Vec<_> from another crate were passed to granges, they may not be compatible due to different global allocatorrs.

jianshu93 commented 1 month ago

Yes exactly. The best way is to add a feature like use_mimalloc. So that users can choose whehter they want to use it or not. I have tried in other crates with very frequent heap allocation. Let me know if this makes sense.

Thanks,

Jianshu

molpopgen commented 1 month ago

There is a matter of taste/style here. The other allocator may improve the performance of the current binary. If that appeared generally true (e..g, not just on macos or whatever), then maybe it should not be feature gated for the binary. The reason is that, imagine allocator A is best on some hardware, and B is best on another. Once you feature-gate in mimalloc ("A"), you cannot add a feature in for B, because you break the expectation that features are strictly additive.

rob-p commented 1 month ago

These are good points. I think the claim is that it makes sense to add the global allocator to the granges binary, and then suggest (e.g. in the README and documentation) that consumers of the library might consider adopting this allocator for best performance.

One thing that I wanted to mention to @jianshu93 is that, while mimalloc provides a nice bump in the current testing, the gains are often even larger in the multi-threaded context — it's improvement in allocation under thread contention is even greater.

@molpopgen: I don't think the improvements of mimalloc over the standard allocator are particularly hardware dependent. Their benchmarks (and my experience) suggest its generally faster at the same basic memory usage.

jianshu93 commented 1 month ago

Hello both,

I actually test it via several different machines in a parallel setting. Intel(R) Xeon(R) Gold 6226, AMD EPYC 7713, AMD EPYC 9534 . The performance bump is more pronounced on the most recent AMD EPYC 9534 for several crate, e.g, graph (skip list, well-known for current/parallel) as the underlying structure and also this crate. I never saw a 20% improvement or something, most around 8% to 15%. Should be interesting to test more different data structure because the mimalloc paper reported only for several give structures.

Thanks,

Jianshu