polars performance with simple array arithmetic much slower than NumPy?

pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust

https://docs.pola.rs

Other

29.87k stars 1.92k forks source link

polars performance with simple array arithmetic much slower than NumPy? #18088

Closed wesm closed 2 months ago

wesm commented 2 months ago

Description

I was working on a histogram implementation for polars in Positron and I stumbled on surprising performance differences for operations with float64 arrays:

In [8]: arr = np.random.randn(1000000)

In [9]: arrp = pl.Series(arr)

In [10]: timeit arr - 1
397 µs ± 67.9 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [11]: timeit arrp - 1
3.4 ms ± 306 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

I haven't looked too deeply, but I am on a AMD Ryzen 7 PRO 7840U which has avx2/avx512 extensions, but almost a 10x difference in performance surprised me. I will just convert polars arrays to NumPy arrays and do the work there, but I would be interested to diagnose the problem further to develop an intuition when I should write polars code vs. dropping down to NumPy when doing numerical operations

coastalwhite commented 2 months ago

This is quite similar to #17414.

Interesting to see that this is happening on AMD, however.

orlp commented 2 months ago

We looked into it further and have narrowed it 100% down to our use of jemallocator. The following toy example shows exactly the same slowdown we see when using Polars, and no slowdown when jemallocator is commented out. time indicates that when jemallocator is used almost all of the time is not spent in userland but rather in the system, indicating it is either expensive kernel calls that jemallocator makes, or extra page faults that jemallocator causes.

#[global_allocator]
static ALLOC: jemallocator::Jemalloc = jemallocator::Jemalloc;

fn sub_one(x: &[f64]) -> Vec<f64> {
    x.iter().map(|x| *x - 1.0).collect()
}

fn main() {
    let v = vec![0.0; 1_000_000];
    let start = std::time::Instant::now();
    for _ in 0..10_000 {
        std::hint::black_box(sub_one(std::hint::black_box(&v)));
    }
    dbg!(start.elapsed() / 10_000);
}

We will have to figure out if this is something we can resolve by configuring jemalloc or whether we need to switch to a different allocator altogether.

orlp commented 2 months ago

Ok, we figured out that the slowdown is caused by page faults that the default jemalloc options causes.

Setting the jemalloc option muzzy_decay_ms to -1 to disable MADV_DONTNEED entirely solves the issue, but might cause overly large memory usage statistics to be reported on some machines (since only MADV_FREE will be emitted and not MADV_DONTNEED).

Strangely enough, setting muzzy_decay_ms to a finite value like 1000 for 1 second delay still causes the problem, even if there isn't a second delay between accesses. I reported the issue to jemalloc here: https://github.com/jemalloc/jemalloc/issues/2688.

ritchie46 commented 2 months ago

Can people try setting _RJEM_MALLOC_CONF="background_thread:true,dirty_decay_ms:500,muzzy_decay_ms:-1" before Polars is imported. This will resolve the issue. I am curious how memory behavior is for your use cases.