Open sslivkoff opened 1 year ago
I can't replicate your result. The polars sample is always faster than the two numpy ones. What versions are you using?
import polars as pl
import numpy as np
from codetiming import Timer
dsz = int(6e7)
ssz = int(1e7)
dat = pl.DataFrame(dict(x = np.random.standard_normal(dsz), y = range(dsz)))
with Timer(text = "Polars sample: {:0.4f} seconds"):
dat2 = dat.sample(ssz, shuffle = False)
# Polars sample: 0.7771 seconds
with Timer(text = "Numpy Sample: {:0.4f} seconds"):
sample = np.random.choice(np.arange(len(dat)), ssz, replace=False)
dat2 = dat[sample]
#Numpy Sample: 4.2366 seconds
with Timer(text = "Numpy Filter: {:0.4f} seconds"):
inds = np.random.choice(np.arange(len(dat)), ssz, replace=False)
filt = np.zeros(len(dat), dtype = bool)
filt[inds] = True
dat2 = dat.filter(filt)
#Numpy Filter: 6.1699 seconds
As for the sampling, it isn't doing any shuffling. But it is calling choose_multiple_fill
from the rand
package. This looks like the "Algorithm R" Reservoir Sampling algorithm (see: https://en.wikipedia.org/wiki/Reservoir_sampling). The comments to that function (in the external package source) state
/// Although the elements are selected randomly, the order of elements in
/// the buffer is neither stable nor fully random. If random ordering is
/// desired, shuffle the result.
And if you look at the algorithm you'll see that it starts out with n
items from the "front" of the sequence, then randomly replaces them (or not) with later elements without regard to order. So if i < j < n
and both i
and j
are selected, then they'll stay in order but otherwise there are no guarantees.
So, if you want to maintain the order when shuffle=False
it's probably fast enough and easiest to sort the sampled index. I think there are some other algorithms that explicitly can maintain order, as well as faster reservoir sampling algorithms but neither seems necessary here.
Rechunking the dataframe may solve this problem.
Problem description
I have a dataframe of shape =
(57_887_821, 2)
. I want a sample of 10M rows. Here are the performances of 3 different techniques:df.sample()
using
df[indices]
using
df.filter()
It would be nice if
df.sample()
was faster because it's simpler than the other methods. Indf.sample()
I think it might be slowed down by unwanted shuffles. The output is shuffled even whenshuffle=False