nci / scores

Metrics for the verification, evaluation and optimisation of forecasts, predictions or models.
https://scores.readthedocs.io/
Apache License 2.0
34 stars 11 forks source link

PIT #142

Open nicholasloveday opened 5 months ago

nicholasloveday commented 5 months ago

Migrate the PIT function across from Jive.

nikeethr commented 2 months ago

Looks like the code from JIVE has been copied in which is useful.

I just noticed a couple of quick things, that maybe are non-issues but just dropping them here for future reference:

nicholasloveday commented 2 months ago

Thanks @nikeethr.

@rob-taggart also has some ideas on how the implementation can be improved before it's merged onto develop.

nikeethr commented 2 months ago

Extracting references:

nikeethr commented 2 months ago

Wikipedia definition

Just going to dump my notes to sound out my understanding, and add/correct them in further comments as my knowledge improves. So don't assume these as fact:

Assuming that the wikipedia definition is accurate (without direct access to its references), here are some foundations:

Side note: A parallel to the above method that I've used in the past in quantization algorithms is to have a bijection inv(F_x): Y -> X, where the "optimal" quantizer is chosen uniformly spaced in the interval [0, 1] and extracting the corresponding quantization intervals in X ~ F_x, the true distribution of the underlying samples.

nikeethr commented 2 months ago

The integrals discussed in https://github.com/nci/scores/issues/142#issuecomment-2106520187 are actually more about approximating continuous integration with computations over finite bins. The PIT while a general concept, in our application actually pertains to "measurable sets", and the domain applied over "measurable functions".

As such, alternatives to the Riemann integral (the one used in calculus and I was intending approximations for in the linked post) are: Jordan, Borel, W. H. Young, Lebesgue, and potentially more - that can be applied to measurable sets.

I'm not sure how much of this the original code uses or needs knowledge of this (since the final equations maybe easily representable as basic sums - again more reading required). Regardless, the Lebesgue measure forms some of foundation and derivation of PIT in various linear/non-linear strategies: https://doi.org/10.1214/13-EJS823

nikeethr commented 2 months ago

Intuitive analysis i.

Probability is a measure of likelihood of an event happening, its measurements are in the range [0, 1]. The domain in which it is defined (supported) for a random process is the measurable space. Integrals in measurable spaces have a bit more relaxed requirements when compared to Riemann integrals due to some additive properties, to do with measure functions over unions of open sets. A measurable function (f) is defined if for all real a the set { x | f(x) > a} is measurable.

If we look at the definition of a CDF for a extended positive real domain: F(x) = P(X <= x) = integral(from=0, to=x, f=pdf). We note that this isn't too different to the generic definition of a measurable function. The other deduction is that if the pdf (probability distribution function) has "gaps", it does not necessarily mean it isn't measurable. Rather, the likelihood of an event occurring there is 0. So the measurement of any subsets of the real line that are empty w.r.t "measurability" can be simply be forced to zero. Which further implies that P(X <= x) (most likely) continues to be well defined in these "empty" regions, just that it's value doesn't change.

In particular, rainfall intuitively is a composite of at least two measurable functions or distributions f_n : no rain, f_r : rain. For this purpose assuming f_n is a point mass at 0, and f_r is a continuous pdf defined for a > 0, with a being the amount of rain. Since the domains are disjoint we can't have the events of rain and no rain at the same time. Regardless, we still can compose them together as a measurable function. The intuitive way of doing this is shifting up the CDF defined by f_r by the probability of "no-rain" f_n, and scaling f_r - f_n such that the integral over the entire domain still is 1 (by definition of probability). In fact, this rather crude explanation can be thought of the weighted aggregation of the two distributions, for which more formal definitions are present in https://doi.org/10.1214/13-EJS823

Follow ups:

nikeethr commented 2 months ago

Intuitive analysis ii.

Before going into the follow-ups it may be worth considering how an implementation for a "naïve" case would look like.

We first define some primitive constraints:

Measurable: a constraint that a type has the ability to be measured i.e. converted into a real value
(e.g. float) via a measurable function.
    - measure :: m -> Real  -- a mapping that outputs a real number for a given input of the
                               underlying type
    - m is Additive and Ordered
    - empty :: 0  -- the empty measure is always 0
    - is_valid    -- runtime check if a function is measurable
    - measure_range :: m -> m -> Real  -- usually continuous functions have a measure of 0 for a
                                          point measurement (since its domain is not countable). 
                                          And measurement over a range makes more sense.

Bounded: data type has a min and max. Used to define the domain.
    - min, max

WeightedPartition: a collection of sets that span a domain with associated weights
    - partitions :: [Partition]
    - weights :: Partition -> Real

Disjoint: a collection of sets that do not overlap.
    - is_valid  -- runtime check if the collection is disjoint

IntegralMeasure: A measurable that can be integrated
    - interval :: Partition -> Distance   -- the resolution of the integral for a given partition
                                             given by a distance measure.
    - integral :: PartitionBounds -> Real -- method used to summate the underlying measure for a
                                             partition domain
    - is Measurable, has Ordered, Disjoint, and Bounded domain

WeightedProbabilityMeasure: Now we can see what a probability measure (i.e. pdf) would look like
    - measure :: m -> [0, 1]
    - measure_range :: m -> m -> [0, 1]
    - partitions
    - weights
    - is Measurable, has an IntegralMeasure, WeightedPartition and for now assume it is Disjoint in its domain
    - is_valid  -- all associated validity checks. Additionally, measurement over entire domain must be 1 or tend to 1.

Then, naturally we can compose a measurement function of a pdf for rainfall for the contrived example. For rainfall with point mass at x = 0 with point probability p(x) = 0.25, and e.g. the +ve half of the standard normal curve for x > 0 as the associated pdf, the composition might look like:

If we formulate the constraints this way, it allows for extensibility to arbitrary definitions prior to computing the PIT. The example above illustrates the combination of a single pmf and pdf, but is not restricted in its extensibility. We also need to explore a few more concepts, Invertible and partitions that are not Disjoint (i.e. overlapping), and their associated strategies.

nikeethr commented 2 months ago

Aside: A quick look at the current implementation shows that the cdf is pre-computed, and the input arguments indicate/hint at locations of the point mass, effectively acting as the partitioning and weighting logic in https://github.com/nci/scores/issues/142#issuecomment-2118825844.

Additionally, the integrals used in the implementation are mainly for variance calculations in the pit_scores outputs. They are used to make inferences on "dispersion", and are calculated using piece-wise linear approximations between partitions. Intuitively the integrand x * (1 - cdf(x)), x >= 0 should also be measurable, so most of https://github.com/nci/scores/issues/142#issuecomment-2118825844 should still be relevant. So there are overlapping data structures between pit calculations and the cdf calculations.