posit-dev / positron

Positron, a next-generation data science IDE
https://positron.posit.co
Other
2.84k stars 91 forks source link

Consider Polars behavior for columns with only np.nan values #4352

Open petetronic opened 3 months ago

petetronic commented 3 months ago

Following on from #4307, and the associated fix #4329, we should review how a Polars series of only np.nan values should summarize. Our treatment for Polars differs from Pandas:

Polars with np.nan series

import polars as pl
import numpy as np
pl_nan = pl.DataFrame({"missing": pl.Series([np.nan] * 5, dtype=pl.Float64)})
Screenshot 2024-08-14 at 10 13 33 AM

Pandas with np.nan series

import pandas as pd
import numpy as np

pd_nan = pd.DataFrame({"missing": pd.Series([np.nan] * 5, dtype="float64")})
Screenshot 2024-08-14 at 10 14 52 AM
dfalbel commented 3 months ago

There's some reasoning for this beahvior in the polars docs:

https://docs.pola.rs/user-guide/expressions/missing-data/#notanumber-or-nan-values

Basically, unlike pandas they don't treat NaN's as missing data. It seems like NaN is the expected behavior for the summary stats here. Note also the %missing is 0. We would want to change the behavior of %missing here too.