vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.31k stars 591 forks source link

Error running percentile_approx #385

Closed mpinkerton-oasis closed 3 years ago

mpinkerton-oasis commented 5 years ago

I am getting an error every time I run the percentile_approx function. Here is an example using the example data set, but I have seen the same error in all of my own test:

In [7]: import vaex as vx
In [8]: ds = vx.example()
In [9]: ds.x
Out[9]: 
Expression = x
Length: 330,000 dtype: float64 (column)
---------------------------------------
 0  -0.777471
 1    3.77427
 2    1.37576
 3   -7.06738
 4   0.243441
   ...       
329995    3.76884
329996    9.17409
329997   -1.14041
329998   -14.2986
329999    10.5451

In [10]: ds.percentile_approx("x", 90)
/home/pinkerton/.virtualenvs/vaex_test/lib/python3.6/site-packages/vaex/dataframe.py:1279:  RuntimeWarning: divide by zero encountered in double_scalars
u = np.array((values - left_value) / (right_value - left_value))
/home/pinkerton/.virtualenvs/vaex_test/lib/python3.6/site-packages/vaex/dataframe.py:1283: RuntimeWarning: invalid value encountered in double_scalars
x = xleft + (xright - xleft) * u  # /2
Out[10]: nan

Here is my version information:

Python 3.6.8

vaex==2.0.2 vaex-arrow==0.3.5 vaex-astro==0.5.0 vaex-core==0.9.2 vaex-hdf5==0.5.4 vaex-server==0.2.1 vaex-viz==0.3.7

JovanVeljanoski commented 5 years ago

Hi,

Am I correct in assuming that you are working on some kind of Linux distribution?

Thank you, Jovan.

mpinkerton-oasis commented 5 years ago

Yes, ubuntu 18.04

maartenbreddels commented 5 years ago

Thanks for the report, we've been able to reproduce it. I hope we can fix it soon.

maartenbreddels commented 5 years ago

It seem the issue is python36 with numpy 1.17, you could downgrade to 1.16. I don't know exactly what the issue it, but it does seem like it's not vaex' fault.

mpinkerton-oasis commented 5 years ago

Hi @maartenbreddels, thanks for looking into this. No problem for us to run with Numpy 1.16 for now.

metalicjames commented 3 years ago

This bug is still present for Python 3.9 on Linux with the latest Vaex (4.1.0) and Numpy (1.20.1) from pip. Downgrading to Numpy 1.16 is no longer a solution as Astropy requires Numpy>=1.17. As a workaround, it seems that setting percentages to a list and putting the value you want in the 2nd or higher index calculates correctly.

e.g. This results in nan

p = df.percentile_approx(df.vals, percentage=0.001)

However, this results in [nan, 0.1637147]

p = df.percentile_approx(df.vals, percentage=[0.001, 0.001])
JovanVeljanoski commented 3 years ago

My magic numpy version is 1.19.5. But this is a numpy issue not a vaex or python issue. You can verify by trying the np.percentile function.