hgrecco / pint-pandas

Pandas support for pint
Other
172 stars 42 forks source link

Can pint-pandas better manage UnitStrippedWarning? And display() more nicely? #125

Closed MichaelTiemannOSC closed 1 year ago

MichaelTiemannOSC commented 2 years ago

I have created a sample notebook that demonstrates the creation of a dataframe with both quanitified and non-quantified columns. In the Quantified cases, some columns are homogeneous in their units, others are heterogeneous. I want to write these dataframes down to a Trino database and then read them back in, and I now have functions to do all that. What I don't have is a good understanding of whether or how to tame the warning messages that say:

/opt/app-root/lib64/python3.8/site-packages/pint_pandas/pint_array.py:648: UnitStrippedWarning: The unit of the quantity is stripped when downcasting to ndarray.
  return np.array(qtys, dtype="object", copy=copy)

Here is the notebook in question: https://github.com/os-climate/data-platform-demo/blob/master/notebooks/pint-demo.ipynb

Here's an annotated explanation of one of my frustrations:

sample_df = pd.DataFrame({'company_name': ['PG&E Corp.', 'PNM Resources, Inc.', 'POSCO', 'PPL Corp.'],
                          'company_lei': ['8YQ2GSDWYZXO2EDN3511', '5493003JOBJGLZSDDQ28', '988400E5HRVX81AYLM04', '9N3UAJSNOUXFKQLF3V18'],
                          'comapny_isin': ['US69331C1080','US69349H1077','KR7005490008','US69351T1060'],
                          '2019_revenue': PintArray([17129000000.0,1457603000.0,55955872344.0,7769000000.0],'USD'),
                          '2016_ghg_s1': PintArray([2.216543993,6.337250786,81.309800,30.08848723],'Mt CO2'),
                          '2017_ghg_s1': PintArray([2.251191566,6.488768702,75.633360,30.24837146],'Mt CO2'),
                          '2018_ghg_s1': PintArray([2.451149772,5.217895758,77.391479,31.61146904],'Mt CO2'),
                          '2019_ghg_s1': PintArray([2.451149772,np.nan,77.391479,np.nan],'Mt CO2')
                          # As of 20220430, the following create the dataframe correctly, but throws UnitsStrippedWarnings
                          # '2016_production': [Q_(32.993292,'TWh'),Q_(10.2316757,'TWh'),Q_(42199000.0,'Fe_ton'),Q_(34.61322117,'TWh')],
                          # '2017_production': [Q_(34.490224,'TWh'),Q_(10.1709745,'TWh'),Q_(37207000.0,'Fe_ton'),Q_(33.53286848,'TWh')],
                          # '2018_production': [Q_(32.28122,'TWh'),Q_(9.307788099,'TWh'),Q_(37735000.0,'Fe_ton'),Q_(35.57197004,'TWh')],
                          })
# We can construct an equivalent DataFrame by separating magnitudes and units, and then combining via multiplication
s_2016 = pd.Series(data=[32.993292, 10.2316757, 42199000.0, 34.61322117], name='2016_production') * pd.Series(data=[ureg(x).u for x in ['TWh','TWh','Fe_ton','TWh']], name='2016_production')
s_2017 = pd.Series(data=[34.490224, 10.1709745, 37207000.0, 33.53286848], name='2017_production') * pd.Series(data=[ureg(x).u for x in ['TWh','TWh','Fe_ton','TWh']], name='2017_production')
s_2018 = pd.Series(data=[32.28122, 9.307788099, 37735000.0, 35.57197004], name='2018_production') * pd.Series(data=[ureg(x).u for x in ['TWh','TWh','Fe_ton','TWh']], name='2018_production')
sample_df = pd.concat([sample_df, s_2016, s_2017, s_2018], axis=1).convert_dtypes()

And when I try to execute sample_df.sort_values(by='company_name') I get this:

/opt/app-root/lib64/python3.8/site-packages/pint_pandas/pint_array.py:648: UnitStrippedWarning: The unit of the quantity is stripped when downcasting to ndarray.
  return np.array(qtys, dtype="object", copy=copy)
/opt/app-root/lib64/python3.8/site-packages/pint_pandas/pint_array.py:648: UnitStrippedWarning: The unit of the quantity is stripped when downcasting to ndarray.
  return np.array(qtys, dtype="object", copy=copy)

before I get the rendered:

  company_name company_lei comapny_isin 2019_revenue 2016_ghg_s1 2017_ghg_s1 2018_ghg_s1 2019_ghg_s1 2016_production 2017_production 2018_production
PG&E Corp. 8YQ2GSDWYZXO2EDN3511 US69331C1080 17129000000.0 2.216543993 2.251191566 2.451149772 2.451149772 32.993292 terawatt_hour 34.490224 terawatt_hour 32.28122 terawatt_hour
PNM Resources, Inc. 5493003JOBJGLZSDDQ28 US69349H1077 1457603000.0 6.337250786 6.488768702 5.217895758 nan 10.2316757 terawatt_hour 10.1709745 terawatt_hour 9.307788099 terawatt_hour
POSCO 988400E5HRVX81AYLM04 KR7005490008 55955872344.0 81.3098 75.63336 77.391479 77.391479 42199000.0 Fe_ton 37207000.0 Fe_ton 37735000.0 Fe_ton
PPL Corp. 9N3UAJSNOUXFKQLF3V18 US69351T1060 7769000000.0 30.08848723 30.24837146 31.61146904 nan 34.61322117 terawatt_hour 33.53286848 terawatt_hour 35.57197004 terawatt_hour

And yes it would be nice if the above properly showed the units stashed in the homogeneous columns.

andrewgsavage commented 2 years ago

I have created a sample notebook that demonstrates the creation of a dataframe with both quanitified and non-quantified columns. In the Quantified cases, some columns are homogeneous in their units, others are heterogeneous.

Heterogeneous units with different dimensions (so can't be converted so there's only one unit in the column) are not supported.


                          # As of 20220430, the following create the dataframe correctly, but throws UnitsStrippedWarnings
                          # '2016_production': [Q_(32.993292,'TWh'),Q_(10.2316757,'TWh'),Q_(42199000.0,'Fe_ton'),Q_(34.61322117,'TWh')],

This line does not create a PintArray. The next lines also do not create a PintArray. Use df.dtypes to confirm this. You can also see this as it shows the units in the cells.

If you can convert from Fe_ton to TWh then you try making a dataframe with company, units as the columns, and year production as the rows. Convert the dataframe to TWh then transpose and append to the rest of the data.

andrewgsavage commented 2 years ago

And yes it would be nice if the above properly showed the units stashed in the homogeneous columns

That's a pandas issue. df.pint.dequantify() is a workaround.

MichaelTiemannOSC commented 2 years ago

I definitely understand that PintArrays need to be homogeneous in their datatypes, which "units of production" are not (because some units are TWh and some are Fe_ton, and yet others are yet other things not contained in this example). I see now that pint.dequantify() can be useful for the purposes of printing a dataframe with unit information per cell. When my Jupyter notebook server comes back from the dead, I'll give that a try.

andrewgsavage commented 1 year ago

I think UnitStrippedWarning should be gone now - can you confirm if this is still an issue?