[BUG-REPORT] Strange behavior - vaex vs. numpy - floating point error?

vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀

MIT License

8.31k stars 591 forks source link

Description Hi,

Thank you for this great software! It's a real game changer for my work!

My question/issue: I observe quantitative differences when I compare the results (of the same computation) of vaex with those of numpy. Those differences are relatively small, but I'd like to understand where they stem from or if I am doing something wrong.

I want to compute the ECDF of a vector. Since the computation is not compatible with vaex-expressions I resort to the apply-method (Ideas how to vectorize the ECDF computation to make it compatible with vaex would be of course warmly welcome!)

The computation with vaex is lighting fast (almost a bit fishy, considering how long numpy takes), but unfortunately the results differ from those with numpy. With my real world data the differences are more pronounced (~1.0e-03).

Am I doing something wrong? Is vaex using different floating points than numpy? Is it a bug?

Your comments/your help would be very much appreciated!

import numpy as np
import vaex

# function to compute the ecdf
def ecdf(x, side='right'):

    v = x.copy()
    v.sort()
    nobs = len(v)
    y = np.linspace(1./nobs, 1, nobs)

    sx = np.hstack((np.array([-np.inf]), v))
    sy = np.hstack((np.array([0]), y))

    tind = np.searchsorted(sx, x, side) - 1
    return sy[tind]

x = np.random.rand(10_000_000,)

df = vaex.from_arrays(x=x)

df['res'] = df.x.apply(ecdf, vectorize = True)
res_vaex = df['res'].evaluate()
res_np = ecdf(x)

np.all(res_vaex == res_np)
# False

np.mean(res_vaex - res_np)
# 3.949999999965477e-06

Software information

Vaex version: 'vaex': '4.16.0', 'vaex-core': '4.16.1', 'vaex-viz': '0.5.4', 'vaex-hdf5': '0.14.1', 'vaex-server': '0.8.1', 'vaex-astro': '0.9.3', 'vaex-jupyter': '0.8.1', 'vaex-ml': '0.18.1'
Vaex was installed via: pip
OS: Ubuntu 18.04

def ecdf_v(x): vc = x.value_counts() v = vc.index.values w = vc.values sw = np.sum(w) ax = v.argsort() v = v[ax] y = np.cumsum(w[ax]) y = y / sw tind = np.searchsorted(v, x.values, 'right') - 1 return y[tind]

vaexio / vaex

[BUG-REPORT] Strange behavior - vaex vs. numpy - floating point error? #2347