nansencenter / DAPPER

Data Assimilation with Python: a Package for Experimental Research
https://nansencenter.github.io/DAPPER
MIT License
348 stars 122 forks source link

UncertainQtty string formatting, feature or bug? #16

Closed yumengch closed 3 years ago

yumengch commented 3 years ago

Sorry for my recent silence.

function mean_with_conf seems to be in the quite high rank of the TODO list. One part of the code is

if (not np.isfinite(mu)) or N <= 5:
    uq = UncertainQtty(mu, np.nan)

This means for short time series, the confidence interval is NaN, which is understandable.

However, a quick test:

>>> np.random.random(4)
array([0.83974936, 0.80939852, 0.9260896 , 0.93455182])
>>> xx = np.random.random(4)
>>> mean_with_conf(xx)
UncertainQtty(val=0.475499999999999978239628717346931807696819305419921875000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000, conf=nan)

The reason is that, in __str__ of class UncertainQtty, n = log10int(c) is -300 since c is NaN. This leads to a format of %0.300f for the mean value. I wonder whether it is designed by purpose or a bug?

Thank you.

patnr commented 3 years ago

Bug, I think 🤔. But where to fix it? As you say, both NaN and 300 kind of make sense

yumengch commented 3 years ago

The mean value and confidence interval is rounded already.

The problematic part tries to ensure equality between the number of decimal of the mean value and the confidence interval.

NaN does not have 'well-defined' number of decimal, which we can avoid this completely. For 0 confidence interval, it is not necessary to show small trailing values.

I think we can try to circumvent the problem by avoiding extreme cases into running the formatting.

In the method def round(self, mult=1.0):, we use following code:

# Extreme cases
if np.isnan(self.conf) or self.conf == 0:
c = self.conf
v = round2sigfig(self.val, rc.sigfig)

In the method def __str__(self):, we add one line of code:

def __str__(self):
    # Round nicely
    v, c = self.round()
    # Ensure we get 1.30 ±0.01, NOT 1.3 ±0.01.
    n = log10int(c)
    if abs(n) == 300:
        # prevent long format for extreme cases 
        return "{} ±{}".format(v, c)
    frmt = "%.f" if n >= 0 else "%%0.%df" % -n
    v = frmt % v
    return "{} ±{}".format(v, c)

This should lead to a reasonable representations, some examples:

>>> import numpy as np
>>> from dapper.tools.series import mean_with_conf, UncertainQtty
>>> UncertainQtty(val = 1.3, conf = 0.01)
UncertainQtty(val=1.30, conf=0.01)
>>> xx = np.random.random(4)
>>> mean_with_conf(xx)
UncertainQtty(val=0.2421, conf=nan)
>>> xx = np.random.random(100)*1e-6 + 2.
>>> mean_with_conf(xx)
UncertainQtty(val=2.0, conf=0)
>>> xx = np.random.random(100) + 3.
>>> mean_with_conf(xx)
UncertainQtty(val=3.51, conf=0.03)
patnr commented 3 years ago

Sorry I'm taking long to reply. And thank you for the suggestions. I haven't fully been able to study your code suggestions yet. After briefly reviewing it, some questions:

Also, do you know if it's possible to use diffs in filing issues like these? You know, the kind that appears when reviewing pull requests. Just because I find it a little hard to read the suggested changes without diffs.

BTW, I'm in the process of implementing docstring tests (example). There are already a few tests in place around the code, but I have yet to include them explicitly in the test suite. Anyways, henceforth we should try to include them if possible.

yumengch commented 3 years ago

Also, do you know if it's possible to use diffs in filing issues like these? You know, the kind that appears when reviewing pull requests. Just because I find it a little hard to read the suggested changes without diffs.

Sorry that I didn't realise this issue. I am not aware of any way of doing exactly what you want at the moment (probably I should do more googling). One alternative might be pulling from my changed code. But I can provide my diff history here:

diff --git a/dapper/tools/series.py b/dapper/tools/series.py
index 681bdba..6c52c8b 100644
--- a/dapper/tools/series.py
+++ b/dapper/tools/series.py
@@ -111,10 +111,10 @@ class UncertainQtty():
             - fallback: rc.sigfig
         """
         # Extreme cases
-        if np.isnan(self.conf):
+        if np.isnan(self.conf) or self.conf == 0:
             c = self.conf
             v = round2sigfig(self.val, rc.sigfig)
         else:
             c = round2sigfig(self.conf, 1)
             v = round2(self.val, mult*self.conf)
@@ -125,6 +125,7 @@ class UncertainQtty():
         v, c = self.round()
         # Ensure we get 1.30 ±0.01, NOT 1.3 ±0.01.
         n = log10int(c)
+        if abs(n) == 300: return "{} ±{}".format(v, c)
         frmt = "%.f" if n >= 0 else "%%0.%df" % -n
         v = frmt % v
         return "{} ±{}".format(v, c)

BTW, I'm in the process of implementing docstring tests (example). There are already a few tests in place around the code, but I have yet to include them explicitly in the test suite. Anyways, henceforth we should try to include them if possible.

This is good. I believe it might be a great way for unit testing in the future.

patnr commented 3 years ago

Sorry again for the delay! Thanks for the clarificiations. The diff looks great btw :-)

I agree conf==0 is a special case. However, to my mind it is already well handled in .round:

>>> UncertainQtty(1.2, 0).round()
(1.2, 0)

Now .round might have other purposes outside of __str__, so I think changes should be made only to __str__.

Now, as you say np.nan/np.inf/0 should be printed directly. This is also another reason not to rely on 300 as a magic number. So, indeed, should instead check for those special cases themselves.

The case conf==0 should actually not fallback to rc.sigfig because that is intended for NaNs, whereas conf==0 is its own special case, which merits more (but not 300!) decimals.

I've made a set of examples/doctests that __str__ should satisfy. You can run them in the terminal using pytest --doctest-modules dapper/tools/series.py. Here they are:

@dataclass
class UncertainQtty():
    """Data container associating uncertainty (confidence) to a quantity.

    Includes intelligent rounding and printing functionality.

    Examples:
    >>> for c in [.01, .1, .2, .9, 1]:
    ...    print(UncertainQtty(1.2345, c))
    1.23 ±0.01
    1.2 ±0.1
    1.2 ±0.2
    1.2 ±0.9
    1 ±1

    >>> for c in [.01, 1e-10, 1e-17, 0]:
    ...    print(UncertainQtty(1.2, c))
    1.20 ±0.01
    1.2000000000 ±1e-10
    1.19999999999999996 ±1e-17
    1.2000000000 ±0

    Note that in the case of a confidence of exactly 0,
    it defaults to 10 decimal places.
    Meanwhile, a NaN confidence yields printing using `rc.sigfig`:

    >>> print(UncertainQtty(1.234567, np.nan))
    1.235 ±nan

    Also note the effect of large uncertainty:

    >>> for c in [1, 9, 10, 11, 20, 100, np.inf]:
    ...    print(UncertainQtty(12, c))
    12 ±1
    12 ±9
    10 ±10
    10 ±10
    10 ±20
    0 ±100
    0 ±inf
    """

    val: float
    conf: float

    def round(self, mult=1.0):  # noqa
        """Round intelligently.

        - `conf` to 1 sig. fig.
        - `val`:
            - to precision: `mult*conf`
            - fallback: `rc.sigfig`
        """
        if np.isnan(self.conf):
            # Fallback to rc.sigfig
            c = self.conf
            v = round2sigfig(self.val, rc.sigfig)
        else:
            # Normal/general case
            c = round2sigfig(self.conf, 1)
            v = round2(self.val, mult*self.conf)
        return v, c

    def __str__(self):
        """Returns 'val ±conf'.

        The value string contains the number of decimals required by the confidence.

        Note: the special cases require processing on top of `round`.
        """
        v, c = self.round()
        if np.isnan(c):
            # Rounding to fallback (rc.sigfig) already took place
            return f"{v} ±{c}"
        if c == 0:
            # 0 (i.e. not 1e-300) never arises "naturally" => Treat it "magically"
            # by truncating to a default. Also see https://stackoverflow.com/a/25899600
            n = -10
        else:
            # Normal/general case.
            n = log10int(c)
        # Ensure we get 1.30 ±0.01, NOT 1.3 ±0.01:
        frmt = "%.f"
        if n < 0:
            frmt = "%%0.%df" % -n
        v = frmt % v
        return f"{v} ±{c}"

    def __repr__(self):
        """Essentially the same as `__str__`."""
        v, c = str(self).split(" ±")
        return self.__class__.__name__ + f"(val={v}, conf={c})"

I'm not 100% satisfied with the example producing 1.19999999999999996 ±1e-17 but I think that's just something one has to live with in the case of decimal representation of floats (fixing it would probably require much hacking).

Finally, I think it might be reasonable to move this class to dapper/tools/rounding.py because it is now mostly about rounding. What do you think?

patnr commented 3 years ago

Probably, also, conf should be renamed prec because one might think that bigger conf means bigger "confidence", whereas the code treats it as "confidence interval" (i.e. smaller is better).

yumengch commented 3 years ago

Thanks Patrick.

I made a few simple tests and I think it is reasonable. The compromise for 10 digits is also acceptable in my opinion.

Finally, I think it might be reasonable to move this class to dapper/tools/rounding.py because it is now mostly about rounding. What do you think?

I think we can move all rounding-related functions/classes into a single file.

Probably, also, conf should be renamed prec because one might think that bigger conf means bigger "confidence", whereas the code treats it as "confidence interval" (i.e. smaller is better).

Great idea to reduce confusion.