hgrecco / pint-pandas

Pandas support for pint
Other
169 stars 42 forks source link

Internals Question: problem creating a value PintType #138

Closed MichaelTiemannOSC closed 1 year ago

MichaelTiemannOSC commented 1 year ago

I've begun doing exploratory surgery on Pint-Pandas to see how ducky things are with uncertainties. I know that the "right" way to do this is to create some new extension types, but I'm exploring some quick-end-dirty things and have already found a few cases where More Work Will Be Needed (such as pd.DataFrame.rolling, which doesn't support ExtensionArrays at all).

I'm looking to add two PintArrays together, and down in arithmetic_op, as expected, should_extension_dispatch of my two operands is true, and there's a call:

        # Timedelta/Timestamp and other custom scalars are included in the check
        # because numexpr will fail on it, see GH#31457
        res_values = op(left, right)

left and right have the expected dtypes:

left =
<PintArray>
[nan, nan, 0.636+/-0.026, 0.635+/-0.033, 0.5741215117186019, 0.5717541911066631, 0.5701038996304524, 0.5684583714983479, 0.5668175929615984, 0.5651815503111365, 0.5635502298774642, 0.5619236180305387, 0.5603017011796584, 0.5586844657733495, 0.5570718982992526, 0.5554639852840103, 0.5538607132931535, 0.5522620689309905, 0.5506680388404943, 0.5490786097031911, 0.547493768239049, 0.5459135012063674, 0.5443377954016657, 0.5427666376595738, 0.5412000148527213, 0.5396379138916285, 0.5380803217245964, 0.5365272253375981, 0.5349786117541702, 0.5334344680353037, 0.5318947812793364, 0.5303595386218453, 0.5288287272355384, 0.5273023343301484, 0.525780347152325]
Length: 35, dtype: pint[CO2e * kt / gigawatt_hour]

right =
<PintArray>
[nan, nan, 0.06+/-0.07, 0.07+/-0.08, 0.05+/-0.07, 0.07+/-0.06, 0.07+/-0.06, 0.07+/-0.06, 0.07+/-0.06, 0.08+/-0.07, 0.08+/-0.07, 0.08+/-0.07, 0.08+/-0.07, 0.09+/-0.07, 0.09+/-0.08, 0.09+/-0.08, 0.09+/-0.08, 0.10+/-0.08, 0.10+/-0.09, 0.10+/-0.09, 0.11+/-0.09, 0.11+/-0.09, 0.11+/-0.10, 0.12+/-0.10, 0.12+/-0.10, 0.12+/-0.11, 0.13+/-0.11, 0.13+/-0.11, 0.13+/-0.12, 0.14+/-0.12, 0.14+/-0.12, 0.15+/-0.13, 0.15+/-0.13, 0.15+/-0.13, 0.16+/-0.14]
Length: 35, dtype: pint[CO2e * kt / gigawatt_hour]

So far so good. Down in _binop (within the _create_method) function we see that lvalues and rvalues have slightly different dtypes (they are not wrapped in pint{ ]):

lvalues =
<Quantity([nan nan 0.6361690482648334+/-0.02592885775735402
 0.6351290527622078+/-0.03311366698282926 0.5741215117186019
 0.5717541911066631 0.5701038996304524 0.5684583714983479
 0.5668175929615984 0.5651815503111365 0.5635502298774642
 0.5619236180305387 0.5603017011796584 0.5586844657733495
 0.5570718982992526 0.5554639852840103 0.5538607132931535
 0.5522620689309905 0.5506680388404943 0.5490786097031911
 0.547493768239049 0.5459135012063674 0.5443377954016657
 0.5427666376595738 0.5412000148527213 0.5396379138916285
 0.5380803217245964 0.5365272253375981 0.5349786117541702
 0.5334344680353037 0.5318947812793364 0.5303595386218453
 0.5288287272355384 0.5273023343301484 0.525780347152325], 'CO2e * kt / gigawatt_hour')>

rvalues =
<Quantity([nan nan 0.064401822057497+/-0.07063402694774396
 0.06695935231991397+/-0.0773048396844469
 0.048286521959792446+/-0.06828369660293009
 0.06759002009043427+/-0.05852837997842698
 0.0696177206931473+/-0.060284231377779794
 0.07170625231394172+/-0.062092758319113185
 0.07385743988335998+/-0.06395554106868657
 0.07607316307986078+/-0.06587420730074718
 0.0783553579722566+/-0.0678504335197696
 0.0807060187114243+/-0.06988594652536269
 0.08312719927276703+/-0.07198252492112357
 0.08562101525095003+/-0.07414200066875729
 0.08818964570847854+/-0.07636626068882
 0.0908353350797329+/-0.0786572485094846
 0.0935603951321249+/-0.08101696596476916
 0.09636720698608865+/-0.08344747494371224
 0.0992582231956713+/-0.0859508991920236
 0.10223596989154145+/-0.08852942616778431
 0.10530304898828768+/-0.09118530895281785
 0.10846214045793631+/-0.0939208682214024
 0.11171600467167442+/-0.09673849426804447
 0.11506748481182465+/-0.09964064909608582
 0.11851950935617939+/-0.10262986856896837
 0.12207509463686476+/-0.10570876462603744
 0.1257373474759707+/-0.10888002756481856
 0.12950946790024984+/-0.11214642839176311
 0.13339475193725733+/-0.11551082124351601
 0.13739659449537506+/-0.1189761458808215
 0.14151849233023628+/-0.12254543025724615
 0.1457640471001434+/-0.12622179316496354
 0.15013696851314767+/-0.13000844695991245
 0.15464107756854212+/-0.13390870036870983
 0.15928030989559838+/-0.13792596137977112], 'CO2e * kt / gigawatt_hour')> 

I remember reading that PintArrays are special things that can deal with units not wrapped in pint[ ]. However...when we go to initialize the PintArray, dtype comes in as <Unit('CO2e * kt / gigawatt_hour')>. This is not seen as a PintType, so an attempt is made to construct one (PintType(dtype)). The new allocator of course does not find dtype (which is called units in this scope) as an instance of PintType (it's what we are construction) and it doesn't find it as an instance of _Unit. The cls._parse_dtype_strict(units) call fails because units is not a str, it's a <class 'pint.util.Unit'>. But it would be happy to construct a PintType for me out of the name of units, with or without the pint[ ] decoration, just not the type that it is.

I'd love to find a way to fix this gently. Thoughts?

MichaelTiemannOSC commented 1 year ago

Answering my own question, the gentle solution is to know that when switching registries, it is not enough to merely re-initialize Quantity, but other things must be fixed. This code from pint/pint/init.py shows what:

# Default Quantity, Unit and Measurement are the ones
# build in the default registry.
Quantity = UnitRegistry.Quantity
Unit = UnitRegistry.Unit
Measurement = UnitRegistry.Measurement
Context = UnitRegistry.Context