hgrecco / pint-pandas

Pandas support for pint
Other
166 stars 40 forks source link

`float("nan")` not always converted to `pd.NA` inside series with pint dtype #238

Open scanzy opened 2 weeks ago

scanzy commented 2 weeks ago

Hello, I am facing this issue while building a pd.Series with pint dtype.

  1. When float("nan") is alone, it remains float("nan").
  2. When float("nan") is with other values, it is converted into pd.NA.

This is not evident printing the series (the formatting shows always nan), but values or tolist() reveal the difference.

import pint as pt
import pandas as pd
import pint_pandas

# case 1: float nan alone
print(pd.Series([float("nan")], dtype="pint[MW]").tolist())
# gives: [<Quantity(nan, 'megawatt')>]

# case 2: float nan with other values
print(pd.Series([float("nan"), 0.0], dtype="pint[MW]").tolist())
# gives: [<Quantity(<NA>, 'megawatt')>, <Quantity(0.0, 'megawatt')>]

I supposed that float("nan") was the default value meaning "not set magnitude". The fact that nan is converted to pd.NA based on other values in the series looks bit tricky to me: is it intended?

I am looking a way to keep not-set values consistent (either all float("nan"), or all pd.NA), but:

  1. Tying to convert pd.NA to float("nan") has no effect.
  2. If I try to convert float("nan") to pd.NA I get ValueError.
# test 1: trying to convert pd.NA to nan
s = pd.Series([float("nan"), 0.0], dtype="pint[MW]")
print(s.tolist())
# gives: [<Quantity(<NA>, 'megawatt')>, <Quantity(0, 'megawatt')>]

print(s.fillna(float("nan")).tolist())
# gives the same: [<Quantity(<NA>, 'megawatt')>, <Quantity(0, 'megawatt')>]

# test 2: trying to convert nan to pd.NA
s = pd.Series([float("nan")], dtype="pint[MW]")
print(s.tolist())
# gives: [<Quantity(nan, 'megawatt')>]

s.fillna(pd.NA)
# gives: ValueError: float() argument must be a string or a real number, not 'NAType'
versions:
- Python 3.11.2
- pandas 2.2.2
- Pint 0.24.1
- Pint-Pandas 0.6
andrewgsavage commented 1 week ago

The difference is due to the underlying data type:

s = pd.Series([float("nan"), 0.0], dtype="pint[MW]")
s.values.data
<FloatingArray>
[<NA>, 0.0]
Length: 2, dtype: Float64

s = pd.Series([float("nan")], dtype="pint[MW]")
s.values.data
<NumpyExtensionArray>
[nan]
Length: 1, dtype: float64
andrewgsavage commented 1 week ago

I think pint-pandas should by:

  1. By default, convert data to a FloatingArray
  2. Have an option to change the conversion to some other dtype
  3. Have an option to prevent conversion, allowing any dtype as the underlying data dtype. In this case, specify the underlying dtype in the pint dtype, eg 'pint[MW][Float64]'