Slowdown of up to 2x compared to NumPy / Pandas in division operator

liufeimath commented 2 weeks ago

Checks

[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of Polars.

Reproducible example

I have the following test code with time measures:

import time
import polars as pl
import pandas as pd
import numpy as np

n = 10_000_000
np_arrs = {"x": np.random.rand(n), "y": np.random.rand(n)}
pl_df = pl.DataFrame(np_arrs)
pd_df = pd.DataFrame(np_arrs)

t0 = time.time()
z = np_arrs["x"] / np_arrs["y"]
t1 = time.time()
print(f"time cost: {round((t1 - t0) * 1000)} ms")  # 20 ms

t0 = time.time()
z = pl_df["x"] / pl_df["y"]
t1 = time.time()
print(f"time cost: {round((t1 - t0) * 1000)} ms")  # 42 ms

t0 = time.time()
z = pd_df["x"] / pd_df["y"]
t1 = time.time()
print(f"time cost: {round((t1 - t0) * 1000)} ms")  # 18 ms

It's bizarre to me that polars is about 2x slow. Does anyone know why?

My test environments:

computer: MBP 2020, 13inch, 2GHz 4-core i5, 16GB mem
python: 3.12.3
numpy: 1.26.4
pandas: 2.2.2
polars: 1.0.0

I have this question originally posted on stackoverflow. The user jqurious pointed out it could be faster by (and it did): z = pl_df.lazy().select(pl.col.x / pl.col.y).collect(streaming=True) without streaming=True, it'll fall back to the old 40ms~50ms time cost.

Hence the question: what should I do to "make it right" in such basic operations?

Log output

No response

Issue description

Polars is 2x slower than pandas in basic column-wise operations (two-column division in this specific test case).

Expected behavior

Polars should have comparable speed v.s. pandas and numpy in such simple operations.

Installed versions

``` --------Version info--------- Polars: 1.0.0 Index type: UInt32 Platform: macOS-13.6.7-x86_64-i386-64bit Python: 3.12.3 (v3.12.3:f6650f9ad7, Apr 9 2024, 08:18:48) [Clang 13.0.0 (clang-1300.0.29.30)] ----Optional dependencies---- adbc_driver_manager: cloudpickle: connectorx: deltalake: fastexcel: fsspec: 2024.3.1 gevent: great_tables: hvplot: matplotlib: nest_asyncio: numpy: 1.26.4 openpyxl: pandas: 2.2.2 pyarrow: 16.1.0 pydantic: 2.7.4 pyiceberg: sqlalchemy: torch: 2.2.2 xlsx2csv: xlsxwriter: ```

orlp commented 2 weeks ago

Could you print the output of the following? A single iteration is very noisy.

from timeit import timeit
import polars as pl
import pandas as pd
import numpy as np

n = 10_000_000
np_arrs = {"x": np.random.rand(n), "y": np.random.rand(n)}
pl_df = pl.DataFrame(np_arrs)
pd_df = pd.DataFrame(np_arrs)

print("numpy", timeit(lambda: np_arrs["x"] / np_arrs["y"], number=100))
print("polars", timeit(lambda: pl_df["x"] / pl_df["y"], number=100))
print("polars (lazy query)", timeit(lambda: pl_df.lazy().select(pl.col.x / pl.col.y).collect(streaming=True), number=100))
print("pandas", timeit(lambda: pd_df["x"] / pd_df["y"], number=100))

liufeimath commented 2 weeks ago

@orlp Sure. Here's what I see:

numpy 1.7664612449007109
polars 3.8596145160263404
polars (lazy query) 1.6810565140331164
pandas 1.7963620449882

orlp commented 2 weeks ago

@liufeimath We can't reproduce on Apple M2 but can on an Intel laptop, we'll take a look at it.

liufeimath commented 2 weeks ago

@orlp Thanks! Another info worth noting: I eyeballed the CPU occupancy in osx activity monitor (after changed number (of measures) to 1000 so that I have longer time for a stable eyeball results):

np costs 100% CPU consistently (1 core)
pl default costs 100% CPU consistently
pl lazy + stream costs 400% CPU consistently (so it uses all my cores)
pd costs 100% CPU consistently (which behaves like a simple wrapper over numpy in this plain test case)

I have another intel laptop (linux) showing similar results, i.e. single core pl costs about 2x time v.s. np. Can it be the boolean indicator array operation cost? But that should be simd bit operations and very fast right? Also it looks like the lazy+stream performance boost comes from multithreading.

Another related guess: my computer is so old that the cpu simd instruction sets are not-at-all up to date. So it has to use some naive bit operations for the additional boolean indicator array, which is as slow as f64 divisions (though that sounds really slow).

coastalwhite commented 2 weeks ago

We looked at this a bit today and I can reproduce this on my machine. We haven't found the exact reason yet. We'll do some more digging.

s-banach commented 2 weeks ago

For reference on my zen4 processor:

numpy 1.1182205999502912
polars 0.7442328999750316
polars (lazy query) 0.6784979000221938
pandas 0.7861381999682635

coastalwhite commented 2 weeks ago

It seems as if for some reason the Polars implementation is causing a whole load of page faults on Intel processors. I am not quite sure why. If we completely isolate the loops:

use polars::prelude::NamedFromOwned;
use polars::series::Series;
use rand::Rng;

const ARRAY_SIZE: usize = 10_000_000;
const NUM_LOOPS: usize = 500;

#[inline(never)]
fn run_loops(a: &Series, b: &Series) {
    let start_time = std::time::SystemTime::now();

    for _ in 0..NUM_LOOPS {
        use std::ops::Div;
        let result = a.div(b);
        let result = result.unwrap();
        _ = std::hint::black_box(result);
    }

    let duration = start_time.elapsed().unwrap();

    println!("time: {}s", duration.as_secs_f32());
}

fn main() {
    let mut a: Vec<f64> = Vec::with_capacity(ARRAY_SIZE);
    let mut b: Vec<f64> = Vec::with_capacity(ARRAY_SIZE);

    unsafe {
        a.set_len(ARRAY_SIZE);
        b.set_len(ARRAY_SIZE);
    }

    let mut rng = rand::thread_rng();

    rng.fill(&mut a[..]);
    rng.fill(&mut b[..]);

    let a = Series::from_vec("a", a);
    let b = Series::from_vec("b", b);

    run_loops(&a, &b);
}

sudo perf stat -e dTLB-load-misses,page-faults ./target/release-with-debug/polars-divloop
time: 21.96687s

 Performance counter stats for './target/release-with-debug/polars-divloop':

        22,569,652      cpu_atom/dTLB-load-misses/                                              (0.34%)
         1,505,594      cpu_core/dTLB-load-misses/                                              (99.66%)
         9,805,232      page-faults

      22.154435210 seconds time elapsed

       6.539028000 seconds user
      15.522541000 seconds sys

And as reference:

use std::mem::MaybeUninit;

const SIZE: usize = 10_000_000;
const REPEATS: usize = 500;

type T = f64;
type L = T;
type R = T;
type O = T;

fn op(l: T, r: T) -> T {
    l / r
}

#[inline(never)]
unsafe fn polars(left: *const L, right: *const R, out: *mut O, len: usize) {
    // This function unrolls manually a result of issue #17414. Here, OLP and GB found that the
    // loop was not properly unrolling on Intel AVX2. We tried several code unrolling techniques
    // and this variant generated good assembly.
    //
    // These blocksizes are based on AVX2 with 16 blocksize for f64, it seems to unroll the loop 4
    // times.
    let block_size = match std::mem::size_of::<O>() {
        0..=1 => 64,
        2 => 32,
        3..=4 => 16,
        5..=8 => 8,
        9..=16 => 4,
        _ => 1,
    };

    // We use a temporary stack buffer so the compiler can read block_size values in a row without
    // worrying our write-back to out overlaps with those reads.
    //
    // This buffer needs to have the size of the maximum block_size.
    let mut buf: [MaybeUninit<T>; 64] = [(); 64].map(|_| MaybeUninit::uninit());
    let mut offset = 0;
    while offset + block_size <= len {
        let buf_chunk = &mut buf[0..block_size];
        let left_chunk = core::slice::from_raw_parts(left.add(offset), block_size);
        let right_chunk = core::slice::from_raw_parts(right.add(offset), block_size);
        for i in 0..block_size {
            buf_chunk[i] = MaybeUninit::new(op(left_chunk[i], right_chunk[i]));
        }

        core::ptr::copy_nonoverlapping(buf_chunk.as_ptr().cast(), out.add(offset), block_size);
        offset += block_size;
    }

    while offset < len {
        let ret = op(left.add(offset).read(), right.add(offset).read());
        out.add(offset).write(ret);
        offset += 1;
    }
}

fn main() {
    let mut left: Vec<T> = Vec::with_capacity(SIZE);
    let mut right: Vec<T> = Vec::with_capacity(SIZE);
    let mut out: Vec<T> = Vec::with_capacity(SIZE);

    for i in 0..SIZE {
        left.push(((i * 23) % 101) as T);
        right.push(((i * 7) % 31) as T);
    }

    let start = std::time::SystemTime::now();

    for _ in 0..REPEATS {
        unsafe { polars(left.as_ptr(), right.as_ptr(), out.as_mut_ptr(), SIZE) };
        unsafe { out.set_len(SIZE) };
        std::hint::black_box(&out);
    }

    let elapsed = start.elapsed().unwrap();

    println!("time: {:.02}s", elapsed.as_secs_f32());
}

sudo perf stat -e dTLB-load-misses,page-faults ./target/release/divloop
time: 7.07s

 Performance counter stats for './target/release/divloop':

         8,666,881      cpu_atom/dTLB-load-misses/                                              (0.16%)
         5,056,767      cpu_core/dTLB-load-misses/                                              (99.84%)
            58,669      page-faults

       7.199732690 seconds time elapsed

       7.052191000 seconds user
       0.119500000 seconds sys

You can see the that whole difference is the time spend in sys so probably resolving the page faults. Very, very peculiar.

coastalwhite commented 2 weeks ago

Quite a rabbit-hole later and we are pretty confident in a theory as to why this is happening, but we don't really have a solution to the problem.

Since this is such a large array operation, the hardware prefetchers are very, very important to performing this fast. It seems, however, that Intel chips don't perform that well at prefetching for three simultaneous linear scans. ARM and AMD seem to be better in that respect. We also don't currently use huge pages, which exacerbates this problem and is probably why NumPy and Pandas faster here.

This seems to apply to most operators.

div fast: 7.501306s
div slow: 19.733433s
mod fast: 50.25861s
mod slow: 66.506935s
mul fast: 7.439692s
mul slow: 18.576355s
add fast: 7.396255s
add slow: 18.637745s

fast uses a third output buffer, which is cached. slow uses a third output buffer that is different each iteration.

Again, some of this is from experimentation and some is from speculation.

I am leaving this issue open as a reference, but I am not 100% sure we are every going to resolve this apart from possibly adding huge pages in the future.

pola-rs / polars