Closed wesm closed 2 months ago
This is quite similar to #17414.
Interesting to see that this is happening on AMD, however.
We looked into it further and have narrowed it 100% down to our use of jemallocator
. The following toy example shows exactly the same slowdown we see when using Polars, and no slowdown when jemallocator
is commented out. time
indicates that when jemallocator
is used almost all of the time is not spent in userland but rather in the system, indicating it is either expensive kernel calls that jemallocator makes, or extra page faults that jemallocator causes.
#[global_allocator]
static ALLOC: jemallocator::Jemalloc = jemallocator::Jemalloc;
fn sub_one(x: &[f64]) -> Vec<f64> {
x.iter().map(|x| *x - 1.0).collect()
}
fn main() {
let v = vec![0.0; 1_000_000];
let start = std::time::Instant::now();
for _ in 0..10_000 {
std::hint::black_box(sub_one(std::hint::black_box(&v)));
}
dbg!(start.elapsed() / 10_000);
}
We will have to figure out if this is something we can resolve by configuring jemalloc
or whether we need to switch to a different allocator altogether.
Ok, we figured out that the slowdown is caused by page faults that the default jemalloc
options causes.
Setting the jemalloc
option muzzy_decay_ms
to -1
to disable MADV_DONTNEED
entirely solves the issue, but might cause overly large memory usage statistics to be reported on some machines (since only MADV_FREE
will be emitted and not MADV_DONTNEED
).
Strangely enough, setting muzzy_decay_ms
to a finite value like 1000
for 1 second delay still causes the problem, even if there isn't a second delay between accesses. I reported the issue to jemalloc here: https://github.com/jemalloc/jemalloc/issues/2688.
Can people try setting _RJEM_MALLOC_CONF="background_thread:true,dirty_decay_ms:500,muzzy_decay_ms:-1"
before Polars is imported. This will resolve the issue. I am curious how memory behavior is for your use cases.
Description
I was working on a histogram implementation for polars in Positron and I stumbled on surprising performance differences for operations with float64 arrays:
I haven't looked too deeply, but I am on a AMD Ryzen 7 PRO 7840U which has avx2/avx512 extensions, but almost a 10x difference in performance surprised me. I will just convert polars arrays to NumPy arrays and do the work there, but I would be interested to diagnose the problem further to develop an intuition when I should write polars code vs. dropping down to NumPy when doing numerical operations