vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.23k stars 590 forks source link

[BUG-REPORT] median_approx and percentile_approx return nan instead of actual value #2230

Open abianco88 opened 1 year ago

abianco88 commented 1 year ago

Description

When using median_approx or percentile_approx to calculate the median or a percentile of a (Vaex) data frame or one of its columns, they return nan instead of the actual value of the statistic. I'm not sure if this might be related to #1304. See for instance the behavior of the median approximation function illustrated below with the Vaex example dataset.

import vaex

vdf = vaex.example()

v_med_x = vdf.median_approx('x')
v_med_xy = vdf.median_approx(['x', 'y'])
v_med_df = vdf.median_approx(vdf.get_column_names())

pdf = vdf.to_pandas_df()

p_med_x = pdf['x'].median()
p_med_xy = pdf[['x', 'y']].median()
p_med_df = pdf.median()

print(f'Vaex results:\n\tMedian of \'x\':\t\t{v_med_x}\n\tMedian of \'x\' and \'y\':\t{v_med_xy}\n\tMedian of all df:\t{v_med_df}\n\nPandas results:\n\tMedian of \'x\':\t\t{p_med_x}\n\tMedian of \'x\' and \'y\':\t{list(p_med_xy)}\n\tMedian of all df:\t{list(p_med_df)}')

This little program returns the following output on the screen, where one can see that the Vaex medians are all nan's and differ from those reported by Pandas (note that they exist and are finite). May there be an issue with the interpolation algorithm used by Vaex to estimate these metrics? Is there a way to get the median/percentile with Vaex at the moment?

Vaex results:
        Median of 'x':          nan
        Median of 'x' and 'y':  [nan nan]
        Median of all df:       [nan nan nan nan nan nan nan nan nan nan nan]

Pandas results:
        Median of 'x':          -0.05056469142436981
        Median of 'x' and 'y':  [-0.05056469142436981, -0.032549407333135605]
        Median of all df:       [16.0, -0.05056469142436981, -0.032549407333135605, -0.007299995049834251, 0.3338821530342102, 2.5303449630737305, 0.5708029270172119, -105428.9765625, 752.062744140625, -164.71417236328125, -1.6478899717330933]

Software information

JovanVeljanoski commented 1 year ago

Please see https://github.com/vaexio/vaex/issues/1318

abianco88 commented 1 year ago

@JovanVeljanoski I did read that issue before posting my bug report, but I thought that having it been opened in April 2021 and not showing any comment after March 2022, it didn't apply anymore. This said, I don't see what the source of the problem is. In issue #1318 there is even a comment stating that the NumPy percentile function works just fine on a Mac machine with NumPy v1.22, but the Vaex function median_approx still returns nan. Is there a solution to this problem or at least a reliable workaround?

JovanVeljanoski commented 1 year ago

As per the thread I posted earlier, and various other discussions on the boards: this is a numpy issue and not related to vaex. In your OP you mention different OS and different numpy version compared to your second message. Maybe you have mixed something up there.

The workaround is to find a numpy version that works (this has also been discussed various times in the boards). You can check this by checking whether numpy.percentile works.

i just tried this on a MacOS machine btw:

image

Check that you are not mixing environments etc..

(bonus: no need to install vaex-arrow anymore, that has been deprecated for a long while now, and it's contents is absorbed in vaex-core).

abianco88 commented 1 year ago

In my post, I referred to the software information of the machine where I experienced the issue myself. Since you deferred me to an old issue report (and immediately closed my post), I brought up the post by heyuqi1970 in that same bug report you linked, where he mentioned that on a Mac system where NumPy worked as expected, Vaex was still returning nan's. Hence, I am not "mixing something up there" or "mixing environments". I would appreciate it if I were not dismissed immediately.

I would also like to know what you mean exactly by a "NumPy version that works […] by checking whether numpy.percentile works." As it happens, the NumPy version I'm running in the environment on a Win10 box works perfectly by that metrics (see the screenshot below where I repeated the exact sequence of commands you showed in yours and added a couple of lines to illustrate that numpy.percentile behaves as expected on an arbitrary array). Are you implying that there is an issue with the functioning of numpy.percentile on a Vaex DataFrame/column? It seems to me that numpy.percentile works just fine otherwise. Is the root cause of the error known?

image

Ben-Epstein commented 1 year ago

@JovanVeljanoski i am seeing the same thing as @abianco88 and have never been able to get around it. I'm not certain it's only a numpy issue

import vaex
import numpy as np

print("vaex", vaex.__version__)
print("numpy", np.__version__)

df = vaex.example()

print(df.percentile_approx("x", 0.5))
print(np.percentile(df.x.values, 0.5))
image
JovanVeljanoski commented 1 year ago

Hey @Ben-Epstein

Yeah we've look at this in some detail in various points in the past, and it was always due to numpy. Perhaps @maartenbreddels can explain it better than myself.

It might be more obvious when you build vaex from source (i.e. reinstall/recompile) under different versions of numpy. I assume you install it via either pip or conda-forge, which (i believe?) expects the numpy version it was built with. This is generally not so strict, but for this one method (i.e. the percentile stuff) it really matters. The workaround / solution is to use an appropriate version of numpy.

In fact, i see the requirements specify numpy<1.21.

If you still think I am wrong on this: PRs are always welcome :)

Ben-Epstein commented 1 year ago

@JovanVeljanoski so I did some playing around, and I have a scenario that may change your mind about this :)

Also @abianco88 see if this works for you!

So, as shown above I always get a nan when calling percentil_approx (i've tried with ~5 different numpy versions).

image

BUT, I get the correct value when I do this

import vaex
import numpy as np

df2 = vaex.example()

xmin, xmax = float(df2.x.min()), float(df2.x.max())
print(df2.percentile_approx("x", 75, binby="x", shape=1, limits=[xmin, xmax])[0])
print(np.percentile(df2.x.to_numpy(), 75))
image

So this makes me think that there is something going on with vaex, seeing as these extra parameters directly fix the problem.

In my case, this workaround is great because I always know the max and min of my particular column

abianco88 commented 1 year ago

This is not about being right or wrong. I'm just a humble user of this promising product, who's hoping it'll get better and, maybe soon enough, become a serious contender if not a drop-in replacement for other production-ready tools. Hence, no need to take it personally.

On a Windows 10 machine (the same one referenced in my OP), I created a brand-new virtual environment with the numpy<1.21 requirement and tried the same commands of @Ben-Epstein's (post). Still returning nan's.

conda create -c conda-forge -n old_numpy_for_vaex
conda activate old_numpy_for_vaex
conda install -c conda-forge "numpy<1.21" pandas vaex-core vaex-viz vaex-ml vaex-hdf5
Python 3.9.13 | packaged by conda-forge | (main, May 27 2022, 16:50:36) [MSC v.1929 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import vaex
>>> import numpy as np
>>>
>>> print("vaex", vaex.__version__)
vaex {'vaex-core': '4.14.0', 'vaex-viz': '0.5.4', 'vaex-hdf5': '0.13.0', 'vaex-ml': '0.18.0'}
>>> print("numpy", np.__version__)
numpy 1.20.3
>>>
>>> df = vaex.example()
>>> print(df.percentile_approx("x", 0.5))
nan
>>> print(np.percentile(df.x.values, 0.5))
-23.44295965194702
>>>
abianco88 commented 1 year ago

@JovanVeljanoski so I did some playing around, and I have a scenario that may change your mind about this :)

Also @abianco88 see if this works for you!

So, as shown above I always get a nan when calling percentil_approx (i've tried with ~5 different numpy versions). image

BUT, I get the correct value when I do this

import vaex
import numpy as np

df2 = vaex.example()

xmin, xmax = float(df2.x.min()), float(df2.x.max())
print(df2.percentile_approx("x", 75, binby="x", shape=1, limits=[xmin, xmax])[0])
print(np.percentile(df2.x.to_numpy(), 75))
image

So this makes me think that there is something going on with vaex, seeing as these extra parameters directly fix the problem.

In my case, this workaround is great because I always know the max and min of my particular column

@Ben-Epstein this does work in the environment of my original post. Thank you!

I can't say for sure, but I suspect this workaround does add a computational burden to the operation as it requires first computing the minimum and maximum of a data frame column (and that, I assume, "wastes" time, especially for large datasets) and then passing more arguments to the percentile_approx call (which may add overhead to the execution of the function). Hence, while this trick makes the function usable, I would claim is sub-optimal. Do you guys agree?

This brings me back to @Ben-Epstein's point:

this makes me think that there is something going on with vaex, seeing as these extra parameters directly fix the problem.

I totally agree with him. The fact that adding extra (redundant) arguments to the function call magically makes it work seems to indicate that the root source of the unexpected behavior is in Vaex rather than NumPy. The "new" issue now is how can we fix it?

Ben-Epstein commented 1 year ago

@abianco88 certainly calculating min and max add overhead. For my case, it's not an issue (it's always 0 and 1)... But I can't speak for yours.

I'm sure there's a lot of complexity under the hood here that I don't understand. Once thing (if this is helpful to the Vaex team), is that numpy as a nanpercentile which ignores nan values. Maybe using that as the underlying function for vaex might help? https://numpy.org/doc/stable/reference/generated/numpy.nanpercentile.html#numpy.nanpercentile

abianco88 commented 1 year ago

@Ben-Epstein I had thought about NumPy's nanpercentile function too, however, I was also considering that Vaex's example data frame doesn't have nan's and the issue is still present. I tried to count the nan's in the example columns with the code below and it seems there are no nan's value.

import vaex
vdf = vaex.example()
count_na = []
for col in vdf.column_names:
    count_na.append(vdf[col].isna().sum().item())
count_na
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

Sure enough, if I run the following loop, I'll get the usual nan answer.

for col in vdf.column_names:
    print(vdf.median_approx('x'))
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan

I don't know how the percentile_approx function works (I can't even locate it in the code base... My bad!), but given your workaround, I'd bet it has to do with the way the interpolation works at the boundaries of the set. Perhaps something happens there and it screws the result. Another possibility might be the binby approach (which still makes me think of the way the set partitioning is done).

Ben-Epstein commented 1 year ago

@abianco88 it seems like the limits param doesn't change anything in terms of the results or performance (i've been testing it).

So feel free to drop that param, should speed things up (no min/max calculation)!

abianco88 commented 1 year ago

@Ben-Epstein are you doing the following?

import vaex
import numpy as np

df2 = vaex.example()

print(df2.percentile_approx("x", 75, binby="x", shape=1))
print(np.percentile(df2.x.to_numpy(), 75))

This returns: [3.44576169] and 3.4649062752723694, so I would say it works.

Now I wonder: does the incorrect percentile_approx behavior have to do with the binby argument? If so, what is causing the issue? I wish @maartenbreddels would chime in and help us sort this out.

Ben-Epstein commented 1 year ago

@abianco88 yes thats what i'm doing, seems to work!

BTW, from my testing, numpy is faster when you have < 1e6 samples, then vaex starts to get much faster. Just worth keeping in mind if you are using this in a server of some sorts.

I'm sure Maarten will come along eventually :) they are probably quite busy working on loads of cool features

abianco88 commented 1 year ago

BTW, from my testing, numpy is faster when you have < 1e6 samples, then vaex starts to get much faster. Just worth keeping in mind if you are using this in a server of some sorts.

I'm trying to execute this on data frames ranging between 5 and 25 M rows, so I'm hoping to have a performance improvement from Vaex! ;-)

JovanVeljanoski commented 1 year ago

Let's keep this open until we figure out the cause.

Edit: i don't have access to a windows machine, but if memory serves, numpy version 1.19.5 should work?

abianco88 commented 1 year ago

I wish I could test this on a Windows machine. I've been trying to install vaex-core on a clean conda environment with only NumPy installed. With either the "numpy<=1.19.5" or the "numpy<1.19.5" argument passed to the conda install -c conda-forge command to prepare the environment by installing NumPy first, when I then execute conda install -c conda-forge vaex-core, I get the following output with the "solving environment" part going on and on for hours.

Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: -

I'm not sure what is going on, but conda seems to have a hard time figuring out the dependencies.

This said, I don't know where this notion of "it's NumPy's fault" is coming from. I believe @Ben-Epstein posts showed that the percentile_approx function produces a reasonable (?) output when extra arguments are passed, which in turn seems to take the blame away from NumPy. If I am not misunderstanding the docs, we shouldn't be passing the binby and shape arguments when calling the percentile_approx function on a vaex.DataFrame column to calculate a "standard" percentile.