pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.29k stars 1.96k forks source link

Faster Sample Performance #9167

Open sslivkoff opened 1 year ago

sslivkoff commented 1 year ago

Problem description

I have a dataframe of shape = (57_887_821, 2). I want a sample of 10M rows. Here are the performances of 3 different techniques:

  1. df.sample()

    sample = df.sample(10_000_000, shuffle=False)
    # takes 10.7 seconds
  2. using df[indices]

    indices = np.random.choice(np.arange(len(df)), 10_000_000, replace=False)
    df[indices]
    # takes 4.9 seconds
  3. using df.filter()

    indices = np.random.choice(np.arange(max_size), 10_000_000, replace=False)
    filter = np.zeros(len(contract_df), dtype=bool)
    filter[indices] = True
    df.filter(filter)
    # takes 474 ms

It would be nice if df.sample() was faster because it's simpler than the other methods. In df.sample() I think it might be slowed down by unwanted shuffles. The output is shuffled even when shuffle=False

magarick commented 1 year ago

I can't replicate your result. The polars sample is always faster than the two numpy ones. What versions are you using?

import polars as pl
import numpy as np
from codetiming import Timer
dsz = int(6e7)
ssz = int(1e7)

dat = pl.DataFrame(dict(x = np.random.standard_normal(dsz), y = range(dsz)))

with Timer(text = "Polars sample: {:0.4f} seconds"):
    dat2 = dat.sample(ssz, shuffle = False)
# Polars sample: 0.7771 seconds

with Timer(text = "Numpy Sample: {:0.4f} seconds"):
    sample = np.random.choice(np.arange(len(dat)), ssz, replace=False)
    dat2 = dat[sample]
#Numpy Sample: 4.2366 seconds

with Timer(text = "Numpy Filter: {:0.4f} seconds"):
    inds = np.random.choice(np.arange(len(dat)), ssz, replace=False)
    filt = np.zeros(len(dat), dtype = bool)    
    filt[inds] = True
    dat2 = dat.filter(filt)
#Numpy Filter: 6.1699 seconds

As for the sampling, it isn't doing any shuffling. But it is calling choose_multiple_fill from the rand package. This looks like the "Algorithm R" Reservoir Sampling algorithm (see: https://en.wikipedia.org/wiki/Reservoir_sampling). The comments to that function (in the external package source) state

    /// Although the elements are selected randomly, the order of elements in
    /// the buffer is neither stable nor fully random. If random ordering is
    /// desired, shuffle the result.

And if you look at the algorithm you'll see that it starts out with n items from the "front" of the sequence, then randomly replaces them (or not) with later elements without regard to order. So if i < j < n and both i and j are selected, then they'll stay in order but otherwise there are no guarantees.

So, if you want to maintain the order when shuffle=False it's probably fast enough and easiest to sort the sampled index. I think there are some other algorithms that explicitly can maintain order, as well as faster reservoir sampling algorithms but neither seems necessary here.

JSteilberg commented 2 months ago

Rechunking the dataframe may solve this problem.