Can pint-pandas better manage UnitStrippedWarning? And display() more nicely?

MichaelTiemannOSC commented 2 years ago

I have created a sample notebook that demonstrates the creation of a dataframe with both quanitified and non-quantified columns. In the Quantified cases, some columns are homogeneous in their units, others are heterogeneous. I want to write these dataframes down to a Trino database and then read them back in, and I now have functions to do all that. What I don't have is a good understanding of whether or how to tame the warning messages that say:

/opt/app-root/lib64/python3.8/site-packages/pint_pandas/pint_array.py:648: UnitStrippedWarning: The unit of the quantity is stripped when downcasting to ndarray.
  return np.array(qtys, dtype="object", copy=copy)

Here is the notebook in question: https://github.com/os-climate/data-platform-demo/blob/master/notebooks/pint-demo.ipynb

Here's an annotated explanation of one of my frustrations:

sample_df = pd.DataFrame({'company_name': ['PG&E Corp.', 'PNM Resources, Inc.', 'POSCO', 'PPL Corp.'],
                          'company_lei': ['8YQ2GSDWYZXO2EDN3511', '5493003JOBJGLZSDDQ28', '988400E5HRVX81AYLM04', '9N3UAJSNOUXFKQLF3V18'],
                          'comapny_isin': ['US69331C1080','US69349H1077','KR7005490008','US69351T1060'],
                          '2019_revenue': PintArray([17129000000.0,1457603000.0,55955872344.0,7769000000.0],'USD'),
                          '2016_ghg_s1': PintArray([2.216543993,6.337250786,81.309800,30.08848723],'Mt CO2'),
                          '2017_ghg_s1': PintArray([2.251191566,6.488768702,75.633360,30.24837146],'Mt CO2'),
                          '2018_ghg_s1': PintArray([2.451149772,5.217895758,77.391479,31.61146904],'Mt CO2'),
                          '2019_ghg_s1': PintArray([2.451149772,np.nan,77.391479,np.nan],'Mt CO2')
                          # As of 20220430, the following create the dataframe correctly, but throws UnitsStrippedWarnings
                          # '2016_production': [Q_(32.993292,'TWh'),Q_(10.2316757,'TWh'),Q_(42199000.0,'Fe_ton'),Q_(34.61322117,'TWh')],
                          # '2017_production': [Q_(34.490224,'TWh'),Q_(10.1709745,'TWh'),Q_(37207000.0,'Fe_ton'),Q_(33.53286848,'TWh')],
                          # '2018_production': [Q_(32.28122,'TWh'),Q_(9.307788099,'TWh'),Q_(37735000.0,'Fe_ton'),Q_(35.57197004,'TWh')],
                          })
# We can construct an equivalent DataFrame by separating magnitudes and units, and then combining via multiplication
s_2016 = pd.Series(data=[32.993292, 10.2316757, 42199000.0, 34.61322117], name='2016_production') * pd.Series(data=[ureg(x).u for x in ['TWh','TWh','Fe_ton','TWh']], name='2016_production')
s_2017 = pd.Series(data=[34.490224, 10.1709745, 37207000.0, 33.53286848], name='2017_production') * pd.Series(data=[ureg(x).u for x in ['TWh','TWh','Fe_ton','TWh']], name='2017_production')
s_2018 = pd.Series(data=[32.28122, 9.307788099, 37735000.0, 35.57197004], name='2018_production') * pd.Series(data=[ureg(x).u for x in ['TWh','TWh','Fe_ton','TWh']], name='2018_production')
sample_df = pd.concat([sample_df, s_2016, s_2017, s_2018], axis=1).convert_dtypes()

And when I try to execute sample_df.sort_values(by='company_name') I get this:

/opt/app-root/lib64/python3.8/site-packages/pint_pandas/pint_array.py:648: UnitStrippedWarning: The unit of the quantity is stripped when downcasting to ndarray.
  return np.array(qtys, dtype="object", copy=copy)
/opt/app-root/lib64/python3.8/site-packages/pint_pandas/pint_array.py:648: UnitStrippedWarning: The unit of the quantity is stripped when downcasting to ndarray.
  return np.array(qtys, dtype="object", copy=copy)

before I get the rendered:

	company_name	company_lei	comapny_isin	2019_revenue	2016_ghg_s1	2017_ghg_s1	2018_ghg_s1	2019_ghg_s1	2016_production	2017_production
PG&E Corp.	8YQ2GSDWYZXO2EDN3511	US69331C1080	17129000000.0	2.216543993	2.251191566	2.451149772	2.451149772	32.993292 terawatt_hour	34.490224 terawatt_hour	32.28122 terawatt_hour
PNM Resources, Inc.	5493003JOBJGLZSDDQ28	US69349H1077	1457603000.0	6.337250786	6.488768702	5.217895758	nan	10.2316757 terawatt_hour	10.1709745 terawatt_hour	9.307788099 terawatt_hour
POSCO	988400E5HRVX81AYLM04	KR7005490008	55955872344.0	81.3098	75.63336	77.391479	77.391479	42199000.0 Fe_ton	37207000.0 Fe_ton	37735000.0 Fe_ton
PPL Corp.	9N3UAJSNOUXFKQLF3V18	US69351T1060	7769000000.0	30.08848723	30.24837146	31.61146904	nan	34.61322117 terawatt_hour	33.53286848 terawatt_hour	35.57197004 terawatt_hour

And yes it would be nice if the above properly showed the units stashed in the homogeneous columns.

andrewgsavage commented 2 years ago

I have created a sample notebook that demonstrates the creation of a dataframe with both quanitified and non-quantified columns. In the Quantified cases, some columns are homogeneous in their units, others are heterogeneous.

Heterogeneous units with different dimensions (so can't be converted so there's only one unit in the column) are not supported.


                          # As of 20220430, the following create the dataframe correctly, but throws UnitsStrippedWarnings
                          # '2016_production': [Q_(32.993292,'TWh'),Q_(10.2316757,'TWh'),Q_(42199000.0,'Fe_ton'),Q_(34.61322117,'TWh')],

This line does not create a PintArray. The next lines also do not create a PintArray. Use df.dtypes to confirm this. You can also see this as it shows the units in the cells.

If you can convert from Fe_ton to TWh then you try making a dataframe with company, units as the columns, and year production as the rows. Convert the dataframe to TWh then transpose and append to the rest of the data.

andrewgsavage commented 2 years ago

And yes it would be nice if the above properly showed the units stashed in the homogeneous columns

That's a pandas issue. df.pint.dequantify() is a workaround.

MichaelTiemannOSC commented 2 years ago

I definitely understand that PintArrays need to be homogeneous in their datatypes, which "units of production" are not (because some units are TWh and some are Fe_ton, and yet others are yet other things not contained in this example). I see now that pint.dequantify() can be useful for the purposes of printing a dataframe with unit information per cell. When my Jupyter notebook server comes back from the dead, I'll give that a try.

andrewgsavage commented 1 year ago

I think UnitStrippedWarning should be gone now - can you confirm if this is still an issue?

hgrecco / pint-pandas

Can pint-pandas better manage UnitStrippedWarning? And display() more nicely? #125