vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.23k stars 590 forks source link

[BUG-REPORT] Strange behavior - vaex vs. numpy - floating point error? #2347

Open akarpf opened 1 year ago

akarpf commented 1 year ago

Description Hi,

Thank you for this great software! It's a real game changer for my work!

My question/issue: I observe quantitative differences when I compare the results (of the same computation) of vaex with those of numpy. Those differences are relatively small, but I'd like to understand where they stem from or if I am doing something wrong.

I want to compute the ECDF of a vector. Since the computation is not compatible with vaex-expressions I resort to the apply-method (Ideas how to vectorize the ECDF computation to make it compatible with vaex would be of course warmly welcome!)

The computation with vaex is lighting fast (almost a bit fishy, considering how long numpy takes), but unfortunately the results differ from those with numpy. With my real world data the differences are more pronounced (~1.0e-03).

Am I doing something wrong? Is vaex using different floating points than numpy? Is it a bug?

Your comments/your help would be very much appreciated!

import numpy as np
import vaex

# function to compute the ecdf
def ecdf(x, side='right'):

    v = x.copy()
    v.sort()
    nobs = len(v)
    y = np.linspace(1./nobs, 1, nobs)

    sx = np.hstack((np.array([-np.inf]), v))
    sy = np.hstack((np.array([0]), y))

    tind = np.searchsorted(sx, x, side) - 1
    return sy[tind]

x = np.random.rand(10_000_000,)

df = vaex.from_arrays(x=x)

df['res'] = df.x.apply(ecdf, vectorize = True)
res_vaex = df['res'].evaluate()
res_np = ecdf(x)

np.all(res_vaex == res_np)
# False

np.mean(res_vaex - res_np)
# 3.949999999965477e-06

Software information

akarpf commented 1 year ago

It turns out I completely misunderstood what apply was doing. I thought that apply with vectorize=True operates on the whole column at once. But it following the vaex logic it applies the function chunk-wise. Since my whole dataset is +25 million rows, the chunks represent a big enough sample to deliver results that are close to the results for the whole column but of course not exactly the same.

Therefore this issue could be closed. But before: can somebody indicate to me a workaround for my problem?

I figure that I could transform the Expression within the function, which I want to apply to it, into an array, compute everything at once in-memory, and then assign the data back to the DataFrame. But I guess this would defy the purpose since then the column would be in-memory as well, no?

I thought about something like that (based on different methods to compute the ECDF):

def ecdf(x):
    v, w = np.unique(x.values, return_counts=True)
    y = np.cumsum(w) / np.sum(w)
    tind = np.searchsorted(v, x.values, 'right') - 1 
    return y[tind]

or, using .value_counts()

def ecdf_v(x):
    vc = x.value_counts()
    v = vc.index.values
    w = vc.values
    sw = np.sum(w)
    ax = v.argsort()

    v = v[ax]
    y = np.cumsum(w[ax])
    y = y / sw

    tind = np.searchsorted(v, x.values, 'right') - 1 
    return y[tind]