Add uncertainties to ITR tool

MichaelTiemannOSC commented 1 year ago

There are two ways we may find uncertainties in data: (1) it is reported to us when an explicit uncertainty, or (2) it is estimated in a way that introduces uncertainties.

In the first instance, a 2021 BCG report tells that only 9% of reporting companies can estimate their emissions comprehensively, and of those, it was estimated that there may be 40% uncertainty with respect to those numbers.

In the second instance, we may be giving an aggregate number (such as Scope 3 emissions) that we cannot properly deal with as such. But we can use sectoral averages to break it into constituent elements that then have validity with respect to various benchmarks. We can use an uncertainty to reflect that our allocation into components has uncertainties. (But those uncertainties are far more manageable than a whole-cloth number where more than half should be discarded lest it lead to certainly wrong double-counting.)

Presently we use float inputs to Pint to create unit-aware data. This issue will be addressed by feeding float inputs to Python uncertainties which we should then be able to wrap with units.

Presently we store unitized data in Trino by dequantifying the data into a magnitude and a unit. For example to store GHG Scope 1 emissions in a table, the value ghg_s1 becomes two columns: ghg_s1 (a floating point value) and ghg_s1_unit(a string that can be parsed by the Pint registry). We will want to enhance this to create three columns: a magnitude (ghg_s1), a unit (ghg_s1_unit), and an uncertainty (ghg_s1_unc).

Unit tests for uncertainties should also be created, covering at least:

[ ] umean (calculate mean of numbers with uncertainties--should push upstream to uncertainties)
[ ] umedian (calculate the median of numbers with uncertainties--should push upstream to uncertainties)
[ ] pd.Series functions with UFloats only, interoperating with UFloat-only series
[ ] pd.DataFrame functions with UFloats only, interoperating with UFloat-only series
[ ] Demonstrate round-trips to Trino database
[ ] Demonstrate usage of uncertainty data using Trino as a source
[ ] Test NaN and pd.NA handling (currently we only handle NaN and don't try to use pd.NA for missing data)

The pint people are not yet convinced the following are generally possible, but it looks to me like they are, based on how well duck-arrays work today. We should create test cases to prove the case

[ ] pd.Series functions with UFloats only, interoperating with mixed UFloat and float64 series
[ ] pd.Series functions with mixed UFloats and float64, interoperating with mixed UFloat and float64 series
[ ] pd.DataFrame functions with UFloats only, interoperating with mixed UFloat and float64 series
[ ] pd.DataFrame functions with mixed UFloats and float64, interoperating with mixed UFloat and float64 series

I have run the Pandas (200K unit tests), Pint, and Pint-Pandas test suites, and submitted some test cases to Pint and Pint-Pandas. But we should have our own unit-tests that test the sanity of what we are doing in our own realm, that would not make for good unit test cases in other realms. The Pandas people rejected a testcase I submitted to them because they don't have uncertainties in their CI/CD system (and are not yet ready to add it).

@LeylaJavadova @ImkeHorten

kmarinushkin commented 1 year ago

@MichaelTiemannOSC since we currently only work with ghg_s1s2, it will make sense for me, to start with ghg_s1s2_unc, right? In our existing default sample xls, which uncertainty would you suggest me to give to existing records? Since we didn't track it before, and it's uncertainty is most probably not zero, what will be our starting value?

MichaelTiemannOSC commented 1 year ago

Yes. ghg_s1 was just an example identifier name. I could have used foo, foo_unit, and foo_unc.

What I'd like to do is to start by only applying uncertainties we believe we are introducing, based on a particular inference/imputation. Later we may find that it's acceptable to report with an uncertainty value of X or to report that one is assuring to a level of Y, but that's a bit of a ways out. The first thing is to see what happens to our math when we change from unitized quantities to unitized uncertainties.

MichaelTiemannOSC commented 8 months ago

@FannySternfeld The pint people not only accepted my uncertainties changes, but they made me a collaborator in the project. So that's good news. I got the impression from some of your colleagues that being able to graph uncertainties (especially for Scope 3) is a very nice feature of the tool. To make it more official, the tool needs some unit tests (as described above).

If this is something you definitely don't want to do, I'll unassign it to you. If it is something you are OK adding to the queue, it's probably something that can wait until the end of the year.

os-climate / ITR

Add uncertainties to ITR tool #156