Open akarpf opened 1 year ago
It turns out I completely misunderstood what apply
was doing. I thought that apply
with vectorize=True
operates on the whole column at once. But it following the vaex logic it applies the function chunk-wise. Since my whole dataset is +25 million rows, the chunks represent a big enough sample to deliver results that are close to the results for the whole column but of course not exactly the same.
Therefore this issue could be closed. But before: can somebody indicate to me a workaround for my problem?
I figure that I could transform the Expression within the function, which I want to apply to it, into an array, compute everything at once in-memory, and then assign the data back to the DataFrame. But I guess this would defy the purpose since then the column would be in-memory as well, no?
I thought about something like that (based on different methods to compute the ECDF):
def ecdf(x):
v, w = np.unique(x.values, return_counts=True)
y = np.cumsum(w) / np.sum(w)
tind = np.searchsorted(v, x.values, 'right') - 1
return y[tind]
or, using .value_counts()
def ecdf_v(x):
vc = x.value_counts()
v = vc.index.values
w = vc.values
sw = np.sum(w)
ax = v.argsort()
v = v[ax]
y = np.cumsum(w[ax])
y = y / sw
tind = np.searchsorted(v, x.values, 'right') - 1
return y[tind]
Description Hi,
Thank you for this great software! It's a real game changer for my work!
My question/issue: I observe quantitative differences when I compare the results (of the same computation) of vaex with those of numpy. Those differences are relatively small, but I'd like to understand where they stem from or if I am doing something wrong.
I want to compute the ECDF of a vector. Since the computation is not compatible with vaex-expressions I resort to the apply-method (Ideas how to vectorize the ECDF computation to make it compatible with vaex would be of course warmly welcome!)
The computation with vaex is lighting fast (almost a bit fishy, considering how long numpy takes), but unfortunately the results differ from those with numpy. With my real world data the differences are more pronounced (~1.0e-03).
Am I doing something wrong? Is vaex using different floating points than numpy? Is it a bug?
Your comments/your help would be very much appreciated!
Software information