hgrecco / pint-pandas

Pandas support for pint
Other
166 stars 40 forks source link

Constructor path unexpectedly changes outcome #205

Open rwijtvliet opened 10 months ago

rwijtvliet commented 10 months ago

Example

import pandas as pd
import pint
import pint_pandas
import numpy as np
units1 = pd.Series([3.0, np.nan]).astype('pint[MWh]')
units2 = pd.Series([3.0, np.nan], dtype='pint[MWh]')
units3 = pd.Series([3, np.nan], dtype='pint[MWh]')

Inconsistency 1: representation of NaN.

The value of the NaN element at index position 1 changes slightly between 1 and 2/3:

>>> units1[1]
<Quantity(nan, 'megawatt_hour')>
>>> units2[1]
<Quantity(<NA>, 'megawatt_hour')>
>>> units3[1]
<Quantity(<NA>, 'megawatt_hour')>

>>> type(units1[1].m)
<class 'numpy.float64'>
>>> type(units2[1].m)
<class 'pandas._libs.missing.NAType'>

Inconsistency 2: impact of int

Also, getting the magnitude delivers inconsistent results. Surprisingly, the difference here is between 1/2 and 3:

>>> units1.pint.m
0    3.0
1    NaN
dtype: float64
>>> units2.pint.m
0    3.0
1    NaN
dtype: float64
>>> units3.pint.m
0       3
1    <NA>
dtype: object

Notice how the latter is a series of objects.

Versions

Tested with following versions of (pandas, pint, pint-pandas):

All give the same result

andrewgsavage commented 10 months ago

you're seeing the effects of the underlying data stored in different ways. you can view this with no_units3.values.data This happens as the data is passed through pd.array in __init__. This ensures the data array can store a form of nan.

If I'm understanding correctly, you're expecting all these to behave the same. Previously the data was converted to float so ints and other types could not be stored, but prevented these issues. However people wanted to store other dtypes.

including the data array dtype, eg 'dtype: pint[m][int]' would help with these issues (at least make it clearer as to why its behaving odd), but hasn't been as necessary so far.

MichaelTiemannOSC commented 10 months ago

In the Pandas world there's a long-running thread about resolving NA vs. np.nan for null values. I did quite a lot of work for a long time on a branch where I used NA as the na_value instead of np.nan, and it works great (and I believe also works for both Float64 and Int64). To make this work, Pint simply needs to cast int64 arrays to Int64 and float64 to Float64. I reversed that when I started working with adding uncertainties to Pint and PintPandas, because Uncertainties is very hardwired to using np.nan as its na_value, and I found it easier to align all numeric types to np.nan (nothwithstanding the problem you show). I would not be surprised if it were actually easy to add a behavioiral flag (NA_VALUE) to set the value Pint and Pint Pandas use and it just work. But I don't have time to develop/test that. (Still waiting for my uncertainties changes to make it through.)