huggingface / candle

Minimalist ML framework for Rust
Apache License 2.0
14.05k stars 776 forks source link

Transparent Huge Pages Support #2149

Open michaeleisel opened 1 month ago

michaeleisel commented 1 month ago

When working with CPU tensors on Linux, transparent huge pages (THP) can provide a big speedup. For example, I see a 15% increase in speed for my code when I turn it on. However, for many operating systems such as Ubuntu, the default behavior for THP is "madvise", which means that madvise must be called with the proper flags on each memory segment that we want THP for. In NumPy, they turn on THP with madvise for any arrays that are 4 MB or larger, when they're on Linux 4.6+ (initial commit: https://github.com/numpy/numpy/commit/7180479b7ce3e3b6455da66d0679274671a46bdc). It would be great to have similar behavior for candle. I'm not exactly sure how this would be implemented, as CpuStorage takes a Vec that may have already been paged in, but maybe calling madvise somehow for all usages of CpuStorage in cpu_backend.rs would cover the broad strokes.

LaurentMazare commented 1 month ago

Interesting, could you maybe provide a way to replicate your 15% speedups? I'm pretty curious about which parts get actually accelerated by THP, if it's more loading the tensors vs the actual ops and if it's in the ops which ones benefit the most of it.

michaeleisel commented 1 month ago

Here are some operations and their speeds without THP (left) and with THP (right): Tensor::ones((5000, 5000), ...): 22 vs. 63 iters/sec a + a, where a is a 5,000x5,000 tensor: 19 vs. 42 iters/sec a.matmul(a), where a is a 5,000x5,000 tensor: 1.65 vs. 1.73 iters/sec