Open MichaelTiemannOSC opened 2 years ago
High-level thoughts on how to start knocking this problem down? Measurements aren't supported by pint-pandas. You'd be looking at writing a new module, with two arrays (for the value and the uncertanity arrays) backing the MeasurementArray instead of one array for the PintArray.
I think uncertainties could be a potential pandas extension on their own, without the units. I've not seen anyone persue this. That would probably be the first step.
Using two columns, possibly using a MultiIndex, is the best approach at the moment.
With this NEP as background, let's look at the lines of code in pint_array.py (init of PintArray) causing grief:
if not np.issubdtype(values.dtype, np.floating):
warnings.warn(
f"pint-pandas does not support magnitudes of {values.dtype}. Converting magnitudes to float.",
category=RuntimeWarning,
)
values = values.astype(float)
Just for grins I visited this code when I had some values that were a mix of pure NaN Quantities and some Quantities with uncertainties (<class 'uncertainties.core.Variable'> and <class 'uncertainties.core.AffineScalarFunc'>). To my surprise, the system was happy to cast this using
from uncertainties.core import AffineScalarFunc
values.astype(AffineScalarFunc)
from uncertainties.core import Variable
values.astype(Variable)
You know this code far better than I do, but I don't see why we cannot just let the promotion rules and do their job, with places like this just asking the type to be promoted to (according to the NEP) and then doing that cast (rather than casting to float). Wouldn't it be great if it were that easy ;-)
These two bits from the NumPy documentation offer encouragement to the idea that PintArrays need not be bound quite so tightly to floats:
I hold out hope that we will find meaningful quantities of things represented properly by complex (or beyond) numbers. I hope we can use PintArrays to put those into DataFrames! This will make ufloat
-based Quantities seem trivial in comparison.
Looking at Quantity (which is quite ducky!), it appears we could create a uQuantity that would inherit from Quantity and have the additional methods for uncertainties, thus ducking the issue of modifying base Pint functionality. People who need such could access nominal_value and std from uQuantity objects. And maybe the creation of a uPintType to handle NaNs with unp.isnan (instead of np.isnan). Wrong or right track?
floats are used because they can store nan, while int and some others can't. This was causing issues where the magnitude would be cast to another dtype so it was easier to convert to float.
Good to know the reason! What immediately comes to mind is that once upon a time NaNs were a handy stand-in for NULL in the numeric world, and there were no such things in the integer world. But since Pandas 1.0.0 there have been Nullable Ints (https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html). So perhaps there's an opportunity to make nullability part of the duck-check, and release the reliance on floats. Having taken a quick look at the usual suspects, things are much more duck-like than float-like throughout (and Monty Python fans will note that ducks do float).
So the big question becomes whether better to add uncertainties via subclassing (and whether the derived class is fairly trivial, just supporting nominal_value and std and unp.isnan).
Alas, https://github.com/pandas-dev/pandas/pull/48971 shows that we are not yet dealing with settled law. If ever there were a place to deal properly with pd.NA in Pint, it should be Pint-Pandas, no? And if so, I should change all my code to work with PintType, not Quantity (the latter of which will still do the heavy lifting for me under the hood).
It is worth noting that measurements predates a lot of the newest Pint features, including pandas integration. And very much needs an overhaul that take them into account. I would be very happy if we can push them.
Based on my research, Measurements could/should be obsoleted by fully supporting a duck-friendly concept of uncertainties. I'll break a few eggs when I get the chance.
True, and that would be great. Still parsing and formatting will need to be supported by pint.
Thus far I'm extremely happy with the results I get from this:
diff --git a/pint_pandas/pint_array.py b/pint_pandas/pint_array.py
index e5e0e9b..6f73fe2 100644
--- a/pint_pandas/pint_array.py
+++ b/pint_pandas/pint_array.py
@@ -216,13 +216,14 @@ class PintArray(ExtensionArray, ExtensionOpsMixin):
if not isinstance(values, np.ndarray):
values = np.array(values, copy=copy)
copy = False
- if not np.issubdtype(values.dtype, np.floating):
- warnings.warn(
- f"pint-pandas does not support magnitudes of {values.dtype}. Converting magnitudes to float.",
- category=RuntimeWarning,
- )
- values = values.astype(float)
- copy = False
+ # Remove this ancient constraint...Ducks Unlimited!!
+ # if not np.issubdtype(values.dtype, np.floating):
+ # warnings.warn(
+ # f"pint-pandas does not support magnitudes of {values.dtype}. Converting magnitudes to float.",
+ # category=RuntimeWarning,
+ # )
+ # values = values.astype(float)
+ # copy = False
if copy:
values = values.copy()
self._data = values
Here's my currently ugly code for handling the addition of uncertainties to the Pint parser (please ignore the calls to breakpoint()
I've inadvertently left in...):
diff --git a/pint/pint_eval.py b/pint/pint_eval.py
index 2054260..c48f4f6 100644
--- a/pint/pint_eval.py
+++ b/pint/pint_eval.py
@@ -12,11 +12,13 @@ from __future__ import annotations
import operator
import token as tokenlib
import tokenize
+from uncertainties import ufloat
from .errors import DefinitionSyntaxError
# For controlling order of operations
_OP_PRIORITY = {
+ "+/-": 4,
"**": 3,
"^": 3,
"unary": 2,
@@ -30,6 +32,10 @@ _OP_PRIORITY = {
}
+def _ufloat(left, right):
+ return ufloat(left, right)
+
+
def _power(left, right):
from . import Quantity
from .compat import is_duck_array
@@ -46,6 +52,7 @@ def _power(left, right):
_BINARY_OPERATOR_MAP = {
+ "+/-": _ufloat,
"**": _power,
"*": operator.mul,
"": operator.mul, # operator for implicit ops
@@ -117,6 +124,7 @@ class EvalTreeNode:
# unary operator
op_text = self.operator[1]
if op_text not in un_op:
+ breakpoint()
raise DefinitionSyntaxError('missing unary operator "%s"' % op_text)
return un_op[op_text](self.left.evaluate(define_op, bin_op, un_op))
else:
@@ -163,6 +171,12 @@ def build_eval_tree(
tokens = list(tokens)
result = None
+
+ def _number_or_nan(token):
+ if (token.type==tokenlib.NUMBER
+ or (token.type==tokenlib.NAME and token.string=='nan')):
+ return True
+ return False
while True:
current_token = tokens[index]
@@ -182,18 +196,72 @@ def build_eval_tree(
# parenthetical group ending, but we need to close sub-operations within group
return result, index - 1
elif token_text == "(":
- # gather parenthetical group
- right, index = build_eval_tree(
- tokens, op_priority, index + 1, 0, token_text
- )
- if not tokens[index][1] == ")":
- raise DefinitionSyntaxError("weird exit from parentheses")
- if result:
- # implicit op with a parenthetical group, i.e. "3 (kg ** 2)"
- result = EvalTreeNode(left=result, right=right)
+ # a ufloat is of the form `( nominal_value + / - std ) possible_e_notation` and parses as a NUMBER
+ # alas, we cannot simply consume the nominal_value and then see the +/- operator, because naive
+ # parsing on the nominal_value thinks it needs to eval the + as part of the nominal_value.
+ if (index+6 < len(tokens)
+ and _number_or_nan(tokens[index+1])
+ and tokens[index+2].string=='+'
+ and tokens[index+3].string=='/'
+ and tokens[index+4].string=='-'
+ and _number_or_nan(tokens[index+5])):
+ # breakpoint()
+ # get nominal_value
+ left, _ = build_eval_tree(
+ # This should feed the parser only a single token--the number representing the nominal_value
+ [tokens[index+1], tokens[-1]], op_priority, 0, 0, tokens[index+1].string
+ )
+ plus_minus_line = tokens[index].line[tokens[index].start[1]:tokens[index+6].end[1]]
+ plus_minus_start = tokens[index+2].start
+ plus_minus_end = tokens[index+4].end
+ plus_minus_operator = tokenize.TokenInfo(type=tokenlib.OP, string='+/-', start=plus_minus_start, end=plus_minus_end, line=plus_minus_line)
+ remaining_line = tokens[index].line[tokens[index+6].end[1]:]
+
+ right, _ = build_eval_tree(
+ [tokens[index+5], tokens[-1]], op_priority, 0, 0, tokens[index+5].string
+ )
+ if tokens[index+6].string==')':
+ # consume the uncertainty number seen thus far
+ index += 6
+ else:
+ raise DefinitionSyntaxError("weird exit from ufloat construction")
+ # now look for possible scientific e-notation
+ if (index+4 < len(tokens)
+ and tokens[index+1].string=='e'
+ and tokens[index+2].string in ['+', '-']
+ and tokens[index+3].type==tokenlib.NUMBER):
+ # There may be more NUMBERS that follow because the tokenizer is lost.
+ # So pick them all up
+ for exp_number_end in range(index+4, len(tokens)):
+ if tokens[exp_number_end].type != tokenlib.NUMBER:
+ break
+ e_notation_line = remaining_line[:tokens[exp_number_end].start[1]-tokens[index+1].start[1]]
+ exp_number = '1.0e' + ''.join([digit.string for digit in tokens[index+3:exp_number_end]])
+ exp_number_token = tokenize.TokenInfo(type=tokenlib.NUMBER, string=exp_number, start=(1, 0), end=(1, len(exp_number)), line=exp_number)
+ e_notation_operator = tokenize.TokenInfo(type=tokenlib.OP, string='*', start=(1, 0), end=(1, 1), line='*')
+ e_notation_scale, _ = build_eval_tree([exp_number_token, tokens[-1]], op_priority, 0, 0, tokens[exp_number_end].string)
+ scaled_left = EvalTreeNode(left, e_notation_operator, e_notation_scale)
+ scaled_right = EvalTreeNode(right, e_notation_operator, e_notation_scale)
+ result = EvalTreeNode(scaled_left, plus_minus_operator, scaled_right)
+ index = exp_number_end
+ # We know we are not at an ENDMARKER here
+ continue
+ else:
+ result = EvalTreeNode(left, plus_minus_operator, right)
+ # We can fall through...index+=1 operation will consume ')'
else:
- # get first token
- result = right
+ # gather parenthetical group
+ right, index = build_eval_tree(
+ tokens, op_priority, index + 1, 0, token_text
+ )
+ if not tokens[index][1] == ")":
+ raise DefinitionSyntaxError("weird exit from parentheses")
+ if result:
+ # implicit op with a parenthetical group, i.e. "3 (kg ** 2)"
+ result = EvalTreeNode(left=result, right=right)
+ else:
+ # get first token
+ result = right
elif token_text in op_priority:
if result:
# equal-priority operators are grouped in a left-to-right order,
@@ -221,7 +289,33 @@ def build_eval_tree(
)
result = EvalTreeNode(left=right, operator=current_token)
elif token_type == tokenlib.NUMBER or token_type == tokenlib.NAME:
- if result:
+ # a ufloat could be naked, meaning `nominal_value + / - std` and parses as a NUMBER
+ # alas, we cannot simply consume the nominal_value and then see the +/- operator, because naive
+ # parsing on the nominal_value thinks it needs to eval the + as part of the nominal_value.
+ if (index+4 < len(tokens)
+ and _number_or_nan(tokens[index])
+ and tokens[index+1].string=='+'
+ and tokens[index+2].string=='/'
+ and tokens[index+3].string=='-'
+ and _number_or_nan(tokens[index+4])):
+ # The +/- operator binds tightest, so we don't need to end a previous binop
+ if tokens[index+5].type==tokenlib.NUMBER:
+ breakpoint()
+ # get nominal_value
+ left = EvalTreeNode(left=current_token)
+ plus_minus_line = tokens[index].line[tokens[index].start[1]:tokens[index+4].end[1]]
+ plus_minus_start = tokens[index+1].start
+ plus_minus_end = tokens[index+3].end
+ plus_minus_operator = tokenize.TokenInfo(type=tokenlib.OP, string='+/-', start=plus_minus_start, end=plus_minus_end, line=plus_minus_line)
+ remaining_line = tokens[index].line[tokens[index+4].end[1]:]
+
+ right, _ = build_eval_tree(
+ [tokens[index+4], tokens[-1]], op_priority, 0, 0, tokens[index+4].string
+ )
+ result = EvalTreeNode(left, plus_minus_operator, right)
+ index += 4
+ continue
+ elif result:
# tokens with an implicit operation i.e. "1 kg"
if op_priority[""] <= op_priority.get(prev_op, -1):
# previous operator is higher priority than implicit, so end
I wonder whether a better approach would be to try to get a PEP moving that accepted the +/- formatting of uncertainties so that they could be parsed more naturally. Or should I customize at the tokenize library level?
Here's my pint fork with branch for parsing uncertainties: https://github.com/MichaelTiemannOSC/pint/tree/parse-uncertainties
And my pint-pandas fork that makes things more ducky: https://github.com/MichaelTiemannOSC/pint-pandas/tree/ducks-unlimited
I commented on parsing uncertainties. I would leave the pint-pandas comments to @andrewgsavage
So I tried something similar. I wanted to use auto-differentiation so you could propagate uncertainties through arbitrary functions, a little like how pint propagates units through most (all?) numpy operations. I was only partially successful, but I think the basic idea should work - especially if you are better at python than me! My attempt is here: https://github.com/varchasgopalaswamy/AutoUncertainties
Discussed in https://github.com/hgrecco/pint/discussions/1604
Here's some sample code I'm playing with. I've commented out lines that throw errors.