Open liufeimath opened 2 weeks ago
Could you print the output of the following? A single iteration is very noisy.
from timeit import timeit
import polars as pl
import pandas as pd
import numpy as np
n = 10_000_000
np_arrs = {"x": np.random.rand(n), "y": np.random.rand(n)}
pl_df = pl.DataFrame(np_arrs)
pd_df = pd.DataFrame(np_arrs)
print("numpy", timeit(lambda: np_arrs["x"] / np_arrs["y"], number=100))
print("polars", timeit(lambda: pl_df["x"] / pl_df["y"], number=100))
print("polars (lazy query)", timeit(lambda: pl_df.lazy().select(pl.col.x / pl.col.y).collect(streaming=True), number=100))
print("pandas", timeit(lambda: pd_df["x"] / pd_df["y"], number=100))
@orlp Sure. Here's what I see:
numpy 1.7664612449007109
polars 3.8596145160263404
polars (lazy query) 1.6810565140331164
pandas 1.7963620449882
@liufeimath We can't reproduce on Apple M2 but can on an Intel laptop, we'll take a look at it.
@orlp Thanks! Another info worth noting: I eyeballed the CPU occupancy in osx activity monitor (after changed number (of measures) to 1000 so that I have longer time for a stable eyeball results):
I have another intel laptop (linux) showing similar results, i.e. single core pl costs about 2x time v.s. np. Can it be the boolean indicator array operation cost? But that should be simd bit operations and very fast right? Also it looks like the lazy+stream performance boost comes from multithreading.
Another related guess: my computer is so old that the cpu simd instruction sets are not-at-all up to date. So it has to use some naive bit operations for the additional boolean indicator array, which is as slow as f64 divisions (though that sounds really slow).
We looked at this a bit today and I can reproduce this on my machine. We haven't found the exact reason yet. We'll do some more digging.
For reference on my zen4 processor:
numpy 1.1182205999502912
polars 0.7442328999750316
polars (lazy query) 0.6784979000221938
pandas 0.7861381999682635
It seems as if for some reason the Polars implementation is causing a whole load of page faults on Intel processors. I am not quite sure why. If we completely isolate the loops:
use polars::prelude::NamedFromOwned;
use polars::series::Series;
use rand::Rng;
const ARRAY_SIZE: usize = 10_000_000;
const NUM_LOOPS: usize = 500;
#[inline(never)]
fn run_loops(a: &Series, b: &Series) {
let start_time = std::time::SystemTime::now();
for _ in 0..NUM_LOOPS {
use std::ops::Div;
let result = a.div(b);
let result = result.unwrap();
_ = std::hint::black_box(result);
}
let duration = start_time.elapsed().unwrap();
println!("time: {}s", duration.as_secs_f32());
}
fn main() {
let mut a: Vec<f64> = Vec::with_capacity(ARRAY_SIZE);
let mut b: Vec<f64> = Vec::with_capacity(ARRAY_SIZE);
unsafe {
a.set_len(ARRAY_SIZE);
b.set_len(ARRAY_SIZE);
}
let mut rng = rand::thread_rng();
rng.fill(&mut a[..]);
rng.fill(&mut b[..]);
let a = Series::from_vec("a", a);
let b = Series::from_vec("b", b);
run_loops(&a, &b);
}
sudo perf stat -e dTLB-load-misses,page-faults ./target/release-with-debug/polars-divloop
time: 21.96687s
Performance counter stats for './target/release-with-debug/polars-divloop':
22,569,652 cpu_atom/dTLB-load-misses/ (0.34%)
1,505,594 cpu_core/dTLB-load-misses/ (99.66%)
9,805,232 page-faults
22.154435210 seconds time elapsed
6.539028000 seconds user
15.522541000 seconds sys
And as reference:
use std::mem::MaybeUninit;
const SIZE: usize = 10_000_000;
const REPEATS: usize = 500;
type T = f64;
type L = T;
type R = T;
type O = T;
fn op(l: T, r: T) -> T {
l / r
}
#[inline(never)]
unsafe fn polars(left: *const L, right: *const R, out: *mut O, len: usize) {
// This function unrolls manually a result of issue #17414. Here, OLP and GB found that the
// loop was not properly unrolling on Intel AVX2. We tried several code unrolling techniques
// and this variant generated good assembly.
//
// These blocksizes are based on AVX2 with 16 blocksize for f64, it seems to unroll the loop 4
// times.
let block_size = match std::mem::size_of::<O>() {
0..=1 => 64,
2 => 32,
3..=4 => 16,
5..=8 => 8,
9..=16 => 4,
_ => 1,
};
// We use a temporary stack buffer so the compiler can read block_size values in a row without
// worrying our write-back to out overlaps with those reads.
//
// This buffer needs to have the size of the maximum block_size.
let mut buf: [MaybeUninit<T>; 64] = [(); 64].map(|_| MaybeUninit::uninit());
let mut offset = 0;
while offset + block_size <= len {
let buf_chunk = &mut buf[0..block_size];
let left_chunk = core::slice::from_raw_parts(left.add(offset), block_size);
let right_chunk = core::slice::from_raw_parts(right.add(offset), block_size);
for i in 0..block_size {
buf_chunk[i] = MaybeUninit::new(op(left_chunk[i], right_chunk[i]));
}
core::ptr::copy_nonoverlapping(buf_chunk.as_ptr().cast(), out.add(offset), block_size);
offset += block_size;
}
while offset < len {
let ret = op(left.add(offset).read(), right.add(offset).read());
out.add(offset).write(ret);
offset += 1;
}
}
fn main() {
let mut left: Vec<T> = Vec::with_capacity(SIZE);
let mut right: Vec<T> = Vec::with_capacity(SIZE);
let mut out: Vec<T> = Vec::with_capacity(SIZE);
for i in 0..SIZE {
left.push(((i * 23) % 101) as T);
right.push(((i * 7) % 31) as T);
}
let start = std::time::SystemTime::now();
for _ in 0..REPEATS {
unsafe { polars(left.as_ptr(), right.as_ptr(), out.as_mut_ptr(), SIZE) };
unsafe { out.set_len(SIZE) };
std::hint::black_box(&out);
}
let elapsed = start.elapsed().unwrap();
println!("time: {:.02}s", elapsed.as_secs_f32());
}
sudo perf stat -e dTLB-load-misses,page-faults ./target/release/divloop
time: 7.07s
Performance counter stats for './target/release/divloop':
8,666,881 cpu_atom/dTLB-load-misses/ (0.16%)
5,056,767 cpu_core/dTLB-load-misses/ (99.84%)
58,669 page-faults
7.199732690 seconds time elapsed
7.052191000 seconds user
0.119500000 seconds sys
You can see the that whole difference is the time spend in sys
so probably resolving the page faults. Very, very peculiar.
Quite a rabbit-hole later and we are pretty confident in a theory as to why this is happening, but we don't really have a solution to the problem.
Since this is such a large array operation, the hardware prefetchers are very, very important to performing this fast. It seems, however, that Intel chips don't perform that well at prefetching for three simultaneous linear scans. ARM and AMD seem to be better in that respect. We also don't currently use huge pages, which exacerbates this problem and is probably why NumPy and Pandas faster here.
This seems to apply to most operators.
div fast: 7.501306s
div slow: 19.733433s
mod fast: 50.25861s
mod slow: 66.506935s
mul fast: 7.439692s
mul slow: 18.576355s
add fast: 7.396255s
add slow: 18.637745s
fast
uses a third output buffer, which is cached. slow
uses a third output buffer that is different each iteration.
Again, some of this is from experimentation and some is from speculation.
I am leaving this issue open as a reference, but I am not 100% sure we are every going to resolve this apart from possibly adding huge pages in the future.
Checks
Reproducible example
I have the following test code with time measures:
It's bizarre to me that polars is about 2x slow. Does anyone know why?
My test environments:
I have this question originally posted on stackoverflow. The user jqurious pointed out it could be faster by (and it did):
z = pl_df.lazy().select(pl.col.x / pl.col.y).collect(streaming=True)
withoutstreaming=True
, it'll fall back to the old 40ms~50ms time cost.Hence the question: what should I do to "make it right" in such basic operations?
Log output
No response
Issue description
Polars is 2x slower than pandas in basic column-wise operations (two-column division in this specific test case).
Expected behavior
Polars should have comparable speed v.s. pandas and numpy in such simple operations.
Installed versions