Pint+Pandas+Uncertainties...how would you ...

MichaelTiemannOSC commented 2 years ago

Discussed in https://github.com/hgrecco/pint/discussions/1604

^{Originally posted by **MichaelTiemannOSC** October 10, 2022} I have spent the past year working on a tool that integrates production, emissions data from corporates to produce a temperature score based on an Implied Temperature Rise model originating from work by the Bank of England. Pint has been a star of this project. Recently I added uncertainties as an enhancement. I started down the track of using Measurements instead of Quantity, but a combination of factors made me think the better solution was to simply create Quantity objects with ufloat values as magnitudes. It seems to work nicely and simply. But now I'm stuck: I rely in renormalizing columns with .astype('pint[some esg unit]'). That's now stopped working with this error message: *** TypeError: can't convert an affine function () to float; use x.nominal_value If I build everything around Measurements, then the nominal values and the error terms stay very much in their own lanes, and I might be able to get away with the kind of renormalizations the code depends on. But Measurements have other problems (np.isnan and unp.isnan are cool with Quantities, but not Measurements). High-level thoughts on how to start knocking this problem down?

Here's some sample code I'm playing with. I've commented out lines that throw errors.

import math
import numpy as np
import pandas as pd
import uncertainties
from uncertainties import unumpy as unp
from uncertainties import ufloat
from uncertainties import umath

from pint import set_application_registry, get_application_registry
from openscm_units import unit_registry
ureg = unit_registry
set_application_registry(ureg)

import pint

pint.Quantity = Q_ = ureg.Quantity
pint.Measurement = M_ = ureg.Measurement

from pint_pandas import PintArray, PintType
PintType.ureg = get_application_registry()
PA_ = PintArray

ureg.define("CO2e = CO2 = CO2eq = CO2_eq")

ff = pd.DataFrame({'company_id':['C1', 'C1', 'C1', 'C2', 'C2', 'C3'],
                   'metric': ['s1', 's1', 's1', 's1', 's1', 's1'],
                   2016: [1.0, 2.0, 3.0, 10.0, 10.0, 5.0]})

print(f"First example: dtype=float;\nff = {ff}")
print(f"ff.groupby(mean) = {ff.groupby(by=['company_id', 'metric']).mean()}")
print(f"ff.groupby(std) = {ff.groupby(by=['company_id', 'metric']).agg(lambda x: x.std())}")

gg = pd.DataFrame({'company_id':['C1', 'C1', 'C1', 'C2', 'C2', 'C3'],
                   'metric': ['s1', 's1', 's1', 's1', 's1', 's1'],
                   2016: [Q_(1.0, 't CO2'),
                          Q_(2.0, 't CO2'),
                          Q_(3.0, 't CO2'),
                          Q_(10.0, 't CO2'),
                          Q_(10.0, 't CO2'),
                          Q_(5.0, 't CO2')]})

print(f"Second exmaple: Quantity, dtype='pint[t CO2]';\nxx = {gg}")
# gg[2016] = gg[2016].astype('pint[t CO2]')                                                                                                                                                                                                                                            
print(f"gg[2016] = {gg[2016]}")
gg_g = gg.groupby(by=['company_id', 'metric'])[2016]
# print(gg_g.agg(lambda x: x.astype('pint[t CO2]').mean()))                                                                                                                                                                                                                            
# print(gg_g.agg(lambda x: x.astype('pint[t CO2]').std()))                                                                                                                                                                                                                             

qq = pd.DataFrame({'company_id':['C1', 'C1', 'C1', 'C2', 'C2', 'C3'],
                   'metric': ['s1', 's1', 's1', 's1', 's1', 's1'],
                   2016: [Q_(ufloat(1.0, 0.1), 't CO2'),
                          Q_(ufloat(2.0, 0.1), 't CO2'),
                          Q_(ufloat(3.0, 0.1), 't CO2'),
                          Q_(ufloat(10.0, 0.1), 't CO2'),
                          Q_(ufloat(10.0, 0.1), 't CO2'),
                          Q_(ufloat(5.0, 0.1), 't CO2')]})

print(f"Third exmaple: Quantity+/-, dtype='pint[t CO2]';\nxx = {qq}")
# qq[2016] = qq[2016].astype('pint[t CO2]')                                                                                                                                                                                                                                            
print(f"qq[2016] = {qq[2016]}")
qq_g = qq.groupby(by=['company_id', 'metric'])[2016]
# print(qq_g.agg(lambda x: x.astype('pint[t CO2]').mean()))                                                                                                                                                                                                                            
# print(qq_g.agg(lambda x: x.astype('pint[t CO2]').std()))                                                                                                                                                                                                                             

mm = pd.DataFrame({'company_id':['C1', 'C1', 'C1', 'C2', 'C2', 'C3'],
                   'metric': ['s1', 's1', 's1', 's1', 's1', 's1'],
                   2016: [M_(1.0, 0.1, 't CO2'),
                          M_(2.0, 0.1, 't CO2'),
                          M_(3.0, 0.1, 't CO2'),
                          M_(10.0, 0.1, 't CO2'),
                          M_(10.0, 0.1, 't CO2'),
                          M_(5.0, 0.1, 't CO2')]})

print(f"Fourth exmaple: Measurements, dtype='pint[t CO2]';\nxx = {mm}")
# mm[2016] = mm[2016].astype('pint[t CO2]')                                                                                                                                                                                                                                            
print(f"mm[2016] = {mm[2016]}")
mm_g = mm.groupby(by=['company_id', 'metric'])[2016]
# print(mm_g.agg(lambda x: x.astype('pint[t CO2]').mean()))                                                                                                                                                                                                                            
# print(mm_g.agg(lambda x: x.astype('pint[t CO2]').std()))

andrewgsavage commented 2 years ago

High-level thoughts on how to start knocking this problem down? Measurements aren't supported by pint-pandas. You'd be looking at writing a new module, with two arrays (for the value and the uncertanity arrays) backing the MeasurementArray instead of one array for the PintArray.

I think uncertainties could be a potential pandas extension on their own, without the units. I've not seen anyone persue this. That would probably be the first step.

Using two columns, possibly using a MultiIndex, is the best approach at the moment.

MichaelTiemannOSC commented 2 years ago

With this NEP as background, let's look at the lines of code in pint_array.py (init of PintArray) causing grief:

        if not np.issubdtype(values.dtype, np.floating):
            warnings.warn(
                f"pint-pandas does not support magnitudes of {values.dtype}. Converting magnitudes to float.",
                category=RuntimeWarning,
            )
            values = values.astype(float)

Just for grins I visited this code when I had some values that were a mix of pure NaN Quantities and some Quantities with uncertainties (<class 'uncertainties.core.Variable'> and <class 'uncertainties.core.AffineScalarFunc'>). To my surprise, the system was happy to cast this using

from uncertainties.core import AffineScalarFunc
values.astype(AffineScalarFunc)

from uncertainties.core import Variable
values.astype(Variable)

You know this code far better than I do, but I don't see why we cannot just let the promotion rules and do their job, with places like this just asking the type to be promoted to (according to the NEP) and then doing that cast (rather than casting to float). Wouldn't it be great if it were that easy ;-)

MichaelTiemannOSC commented 2 years ago

These two bits from the NumPy documentation offer encouragement to the idea that PintArrays need not be bound quite so tightly to floats:

I hold out hope that we will find meaningful quantities of things represented properly by complex (or beyond) numbers. I hope we can use PintArrays to put those into DataFrames! This will make ufloat-based Quantities seem trivial in comparison.

MichaelTiemannOSC commented 2 years ago

Looking at Quantity (which is quite ducky!), it appears we could create a uQuantity that would inherit from Quantity and have the additional methods for uncertainties, thus ducking the issue of modifying base Pint functionality. People who need such could access nominal_value and std from uQuantity objects. And maybe the creation of a uPintType to handle NaNs with unp.isnan (instead of np.isnan). Wrong or right track?

andrewgsavage commented 2 years ago

floats are used because they can store nan, while int and some others can't. This was causing issues where the magnitude would be cast to another dtype so it was easier to convert to float.

MichaelTiemannOSC commented 2 years ago

Good to know the reason! What immediately comes to mind is that once upon a time NaNs were a handy stand-in for NULL in the numeric world, and there were no such things in the integer world. But since Pandas 1.0.0 there have been Nullable Ints (https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html). So perhaps there's an opportunity to make nullability part of the duck-check, and release the reliance on floats. Having taken a quick look at the usual suspects, things are much more duck-like than float-like throughout (and Monty Python fans will note that ducks do float).

So the big question becomes whether better to add uncertainties via subclassing (and whether the derived class is fairly trivial, just supporting nominal_value and std and unp.isnan).

MichaelTiemannOSC commented 2 years ago

Alas, https://github.com/pandas-dev/pandas/pull/48971 shows that we are not yet dealing with settled law. If ever there were a place to deal properly with pd.NA in Pint, it should be Pint-Pandas, no? And if so, I should change all my code to work with PintType, not Quantity (the latter of which will still do the heavy lifting for me under the hood).

hgrecco commented 2 years ago

It is worth noting that measurements predates a lot of the newest Pint features, including pandas integration. And very much needs an overhaul that take them into account. I would be very happy if we can push them.

MichaelTiemannOSC commented 2 years ago

Based on my research, Measurements could/should be obsoleted by fully supporting a duck-friendly concept of uncertainties. I'll break a few eggs when I get the chance.

hgrecco commented 2 years ago

True, and that would be great. Still parsing and formatting will need to be supported by pint.

MichaelTiemannOSC commented 2 years ago

Thus far I'm extremely happy with the results I get from this:

diff --git a/pint_pandas/pint_array.py b/pint_pandas/pint_array.py
index e5e0e9b..6f73fe2 100644
--- a/pint_pandas/pint_array.py
+++ b/pint_pandas/pint_array.py
@@ -216,13 +216,14 @@ class PintArray(ExtensionArray, ExtensionOpsMixin):
         if not isinstance(values, np.ndarray):
             values = np.array(values, copy=copy)
             copy = False
-        if not np.issubdtype(values.dtype, np.floating):
-            warnings.warn(
-                f"pint-pandas does not support magnitudes of {values.dtype}. Converting magnitudes to float.",
-                category=RuntimeWarning,
-            )
-            values = values.astype(float)
-            copy = False
+        # Remove this ancient constraint...Ducks Unlimited!!
+        # if not np.issubdtype(values.dtype, np.floating):
+        #     warnings.warn(
+        #         f"pint-pandas does not support magnitudes of {values.dtype}. Converting magnitudes to float.",
+        #         category=RuntimeWarning,
+        #     )
+        #     values = values.astype(float)
+        #     copy = False
         if copy:
             values = values.copy()
         self._data = values

MichaelTiemannOSC commented 2 years ago

Here's my currently ugly code for handling the addition of uncertainties to the Pint parser (please ignore the calls to breakpoint() I've inadvertently left in...):

diff --git a/pint/pint_eval.py b/pint/pint_eval.py
index 2054260..c48f4f6 100644
--- a/pint/pint_eval.py
+++ b/pint/pint_eval.py
@@ -12,11 +12,13 @@ from __future__ import annotations
 import operator
 import token as tokenlib
 import tokenize
+from uncertainties import ufloat

 from .errors import DefinitionSyntaxError

 # For controlling order of operations
 _OP_PRIORITY = {
+    "+/-": 4,
     "**": 3,
     "^": 3,
     "unary": 2,
@@ -30,6 +32,10 @@ _OP_PRIORITY = {
 }

+def _ufloat(left, right):
+    return ufloat(left, right)
+
+
 def _power(left, right):
     from . import Quantity
     from .compat import is_duck_array
@@ -46,6 +52,7 @@ def _power(left, right):

 _BINARY_OPERATOR_MAP = {
+    "+/-": _ufloat,
     "**": _power,
     "*": operator.mul,
     "": operator.mul,  # operator for implicit ops
@@ -117,6 +124,7 @@ class EvalTreeNode:
             # unary operator
             op_text = self.operator[1]
             if op_text not in un_op:
+                breakpoint()
                 raise DefinitionSyntaxError('missing unary operator "%s"' % op_text)
             return un_op[op_text](self.left.evaluate(define_op, bin_op, un_op))
         else:
@@ -163,6 +171,12 @@ def build_eval_tree(
         tokens = list(tokens)

     result = None
+    
+    def _number_or_nan(token):
+        if (token.type==tokenlib.NUMBER
+            or (token.type==tokenlib.NAME and token.string=='nan')):
+            return True
+        return False

     while True:
         current_token = tokens[index]
@@ -182,18 +196,72 @@ def build_eval_tree(
                     # parenthetical group ending, but we need to close sub-operations within group
                     return result, index - 1
             elif token_text == "(":
-                # gather parenthetical group
-                right, index = build_eval_tree(
-                    tokens, op_priority, index + 1, 0, token_text
-                )
-                if not tokens[index][1] == ")":
-                    raise DefinitionSyntaxError("weird exit from parentheses")
-                if result:
-                    # implicit op with a parenthetical group, i.e. "3 (kg ** 2)"
-                    result = EvalTreeNode(left=result, right=right)
+                # a ufloat is of the form `( nominal_value + / - std ) possible_e_notation` and parses as a NUMBER
+                # alas, we cannot simply consume the nominal_value and then see the +/- operator, because naive
+                # parsing on the nominal_value thinks it needs to eval the + as part of the nominal_value.
+                if (index+6 < len(tokens)
+                    and _number_or_nan(tokens[index+1])
+                    and tokens[index+2].string=='+'
+                    and tokens[index+3].string=='/'
+                    and tokens[index+4].string=='-'
+                    and _number_or_nan(tokens[index+5])):
+                    # breakpoint()
+                    # get nominal_value
+                    left, _ = build_eval_tree(
+                        # This should feed the parser only a single token--the number representing the nominal_value
+                        [tokens[index+1], tokens[-1]], op_priority, 0, 0, tokens[index+1].string
+                    )
+                    plus_minus_line = tokens[index].line[tokens[index].start[1]:tokens[index+6].end[1]]
+                    plus_minus_start = tokens[index+2].start
+                    plus_minus_end = tokens[index+4].end
+                    plus_minus_operator = tokenize.TokenInfo(type=tokenlib.OP, string='+/-', start=plus_minus_start, end=plus_minus_end, line=plus_minus_line)
+                    remaining_line = tokens[index].line[tokens[index+6].end[1]:]
+
+                    right, _ = build_eval_tree(
+                        [tokens[index+5], tokens[-1]], op_priority, 0, 0, tokens[index+5].string
+                    )
+                    if tokens[index+6].string==')':
+                        # consume the uncertainty number seen thus far
+                        index += 6
+                    else:
+                        raise DefinitionSyntaxError("weird exit from ufloat construction")
+                    # now look for possible scientific e-notation
+                    if (index+4 < len(tokens)
+                        and tokens[index+1].string=='e'
+                        and tokens[index+2].string in ['+', '-']
+                        and tokens[index+3].type==tokenlib.NUMBER):
+                        # There may be more NUMBERS that follow because the tokenizer is lost.
+                        # So pick them all up
+                        for exp_number_end in range(index+4, len(tokens)):
+                            if tokens[exp_number_end].type != tokenlib.NUMBER:
+                                break
+                        e_notation_line = remaining_line[:tokens[exp_number_end].start[1]-tokens[index+1].start[1]]
+                        exp_number = '1.0e' + ''.join([digit.string for digit in tokens[index+3:exp_number_end]])
+                        exp_number_token = tokenize.TokenInfo(type=tokenlib.NUMBER, string=exp_number, start=(1, 0), end=(1, len(exp_number)), line=exp_number)
+                        e_notation_operator = tokenize.TokenInfo(type=tokenlib.OP, string='*', start=(1, 0), end=(1, 1), line='*')
+                        e_notation_scale, _ = build_eval_tree([exp_number_token, tokens[-1]], op_priority, 0, 0, tokens[exp_number_end].string)
+                        scaled_left = EvalTreeNode(left, e_notation_operator, e_notation_scale)
+                        scaled_right = EvalTreeNode(right, e_notation_operator, e_notation_scale)
+                        result = EvalTreeNode(scaled_left, plus_minus_operator, scaled_right)
+                        index = exp_number_end
+                        # We know we are not at an ENDMARKER here
+                        continue
+                    else:
+                        result = EvalTreeNode(left, plus_minus_operator, right)
+                        # We can fall through...index+=1 operation will consume ')'
                 else:
-                    # get first token
-                    result = right
+                    # gather parenthetical group
+                    right, index = build_eval_tree(
+                        tokens, op_priority, index + 1, 0, token_text
+                    )
+                    if not tokens[index][1] == ")":
+                        raise DefinitionSyntaxError("weird exit from parentheses")
+                    if result:
+                        # implicit op with a parenthetical group, i.e. "3 (kg ** 2)"
+                        result = EvalTreeNode(left=result, right=right)
+                    else:
+                        # get first token
+                        result = right
             elif token_text in op_priority:
                 if result:
                     # equal-priority operators are grouped in a left-to-right order,
@@ -221,7 +289,33 @@ def build_eval_tree(
                     )
                     result = EvalTreeNode(left=right, operator=current_token)
         elif token_type == tokenlib.NUMBER or token_type == tokenlib.NAME:
-            if result:
+            # a ufloat could be naked, meaning  `nominal_value + / - std` and parses as a NUMBER
+            # alas, we cannot simply consume the nominal_value and then see the +/- operator, because naive
+            # parsing on the nominal_value thinks it needs to eval the + as part of the nominal_value.
+            if (index+4 < len(tokens)
+                and _number_or_nan(tokens[index])
+                and tokens[index+1].string=='+'
+                and tokens[index+2].string=='/'
+                and tokens[index+3].string=='-'
+                and _number_or_nan(tokens[index+4])):
+                # The +/- operator binds tightest, so we don't need to end a previous binop
+                if tokens[index+5].type==tokenlib.NUMBER:
+                    breakpoint()
+                # get nominal_value
+                left = EvalTreeNode(left=current_token)
+                plus_minus_line = tokens[index].line[tokens[index].start[1]:tokens[index+4].end[1]]
+                plus_minus_start = tokens[index+1].start
+                plus_minus_end = tokens[index+3].end
+                plus_minus_operator = tokenize.TokenInfo(type=tokenlib.OP, string='+/-', start=plus_minus_start, end=plus_minus_end, line=plus_minus_line)
+                remaining_line = tokens[index].line[tokens[index+4].end[1]:]
+
+                right, _ = build_eval_tree(
+                    [tokens[index+4], tokens[-1]], op_priority, 0, 0, tokens[index+4].string
+                )
+                result = EvalTreeNode(left, plus_minus_operator, right)
+                index += 4
+                continue
+            elif result:
                 # tokens with an implicit operation i.e. "1 kg"
                 if op_priority[""] <= op_priority.get(prev_op, -1):
                     # previous operator is higher priority than implicit, so end

I wonder whether a better approach would be to try to get a PEP moving that accepted the +/- formatting of uncertainties so that they could be parsed more naturally. Or should I customize at the tokenize library level?

MichaelTiemannOSC commented 2 years ago

Here's my pint fork with branch for parsing uncertainties: https://github.com/MichaelTiemannOSC/pint/tree/parse-uncertainties

And my pint-pandas fork that makes things more ducky: https://github.com/MichaelTiemannOSC/pint-pandas/tree/ducks-unlimited

hgrecco commented 2 years ago

I commented on parsing uncertainties. I would leave the pint-pandas comments to @andrewgsavage

varchasgopalaswamy commented 2 years ago

So I tried something similar. I wanted to use auto-differentiation so you could propagate uncertainties through arbitrary functions, a little like how pint propagates units through most (all?) numpy operations. I was only partially successful, but I think the basic idea should work - especially if you are better at python than me! My attempt is here: https://github.com/varchasgopalaswamy/AutoUncertainties

hgrecco / pint

Pint+Pandas+Uncertainties...how would you ... #1605

Discussed in https://github.com/hgrecco/pint/discussions/1604