vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.23k stars 590 forks source link

[BUG-REPORT] Sometimes `std` returns `nan` when it shouldn't #2291

Open Ben-Epstein opened 1 year ago

Ben-Epstein commented 1 year ago

Thank you for reaching out and helping us improve Vaex!

Before you submit a new Issue, please read through the documentation. Also, make sure you search through the Open and Closed Issues - your problem may already be discussed or addressed.

Description

import vaex

vaex.from_arrays(value=[0.0179862099620915]*100).value.std()  # nan
vaex.from_arrays(value=[0.0179862099620915]*100).to_pandas_df().value.std()  # ~0

Software information

Additional information Please state any supplementary information or provide additional context for the problem (e.g. screenshots, data, etc..).

JovanVeljanoski commented 1 year ago

Hi Ben!

Thanks for this. I just pushed a quick fix. Hopefully @maartenbreddels manages to have a look at it soon.

Note: even with the fix there will be some difference for this case compared to pandas/numpy. This is because vaex uses an unstable algorithm for this. But I think that is relevant mainly for this edge case, where the variance is sooo small (close to 0).