vaexio / vaex

Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python, ML, visualization and exploration of big tabular data at a billion rows per second 🚀
https://vaex.io
MIT License
8.28k stars 590 forks source link

[BUG-REPORT] Interchange `Column.null_count` does not count floating-point NaNs correctly #2120

Closed honno closed 2 years ago

honno commented 2 years ago

When an interchange protocol column holds NaNs, null_count still returns 0.

>>> df = vaex.from_dict({"foo": [float("nan")]})
>>> interchange_df = df.__dataframe__()
>>> interchange_col = interchange_df.get_column_by_name("foo")
>>> interchange_col.null_count
0  # should be 1

null_count here is using Expression.countmissing() internally

https://github.com/vaexio/vaex/blob/35c250d585f889272b8ef1096de6fa5462816f52/packages/vaex-core/vaex/dataframe_protocol.py#L495-L500

but I assume it should be using countna() instead.

>>> df["foo"].countmissing()
0
>>> df["foo"].countna()
1
JovanVeljanoski commented 2 years ago

Yeah, so a NULL is not the same as nan.

nan (not a number) is a float, but NULL is a missing value. NA (not available) is both of those.

So nan is not really missing (or NULL), it is there, but it is not a number (like division by zero for example). NULL is if a value is missing (like a measurement was not made at all).

We should make a guide on these differences between special or missing values.

P.S.: I know very little about the interchange protocol, but on the vaex side, this is how we think about these special values.

honno commented 2 years ago

Ah, hadn't realised there was a semantic model of NA encompassing NaN and NULL—makes a lot of sense!

However I think for purposes of the interchange protocol, the concept of "null" does mean to include NaN values if you look at the describe_null docstring, e.g. pandas currently encompasses NaN in null_count.

>>> import pandas as pd
>>> pd.DataFrame({"foo": [float("nan")]})
>>> interchange_df = df.__dataframe__()
>>> interchange_col = interchange_df.get_column_by_name("foo")
>>> interchange_col.null_count
1
JovanVeljanoski commented 2 years ago

@maartenbreddels maybe you can shed some more light here..

maartenbreddels commented 2 years ago

We had a lot of discussions about this, but I never managed to convincingly convey the importance of the difference between the two. I think we just have to accept that using the exchange protocol one cannot find out the difference between the two I think.

@honno do you feel like opening a PR to fix this?