PIT - Githubissues

nicholasloveday commented 8 months ago

Migrate the PIT function across from Jive.

nikeethr commented 5 months ago

Looks like the code from JIVE has been copied in which is useful.

I just noticed a couple of quick things, that maybe are non-issues but just dropping them here for future reference:

discrete integral and corresponding assumptions in the integral calculation. I haven't gone through it thoroughly, but depending on how the signals are sampled the type of integral method makes a difference in the final value - e.g. quadrature v.s. simpsons v.s. trapezoidal v.s. rectangular (basic summation).
There is quite liberal use of .where which is usually fast if all the data is loaded in memory (usually the case in numpy). But xarray tends to lazy load things - so using .where might actually be slow because the underlying C implementation does not yet know the extent of the data at the point where it is called. A simple da.load() or directly operating on numpy by referencing da.values may resolve this prior to .where searches on not so small data. Pre-loading may make memory de-allocation a bit simple for the garbage collector.

nicholasloveday commented 5 months ago

Thanks @nikeethr.

@rob-taggart also has some ideas on how the implementation can be improved before it's merged onto develop.

nikeethr commented 5 months ago

Extracting references:

https://en.wikipedia.org/wiki/Probability_integral_transform (and associated references)
Robert Taggart, "Assessing calibration when predictive distributions have discontinuities", Bureau Research Report, 64, 2022. http://www.bom.gov.au/research/publications/researchreports/BRR-064.pdf
Tilmann Gneiting and Roopesh Ranjan, "Combining predictive distributions." Electron. J. Statist. 7 1747 - 1782, 2013. https://doi.org/10.1214/13-EJS823

nikeethr commented 5 months ago

Wikipedia definition

Just going to dump my notes to sound out my understanding, and add/correct them in further comments as my knowledge improves. So don't assume these as fact:

Assuming that the wikipedia definition is accurate (without direct access to its references), here are some foundations:

Y = F_x(X) is uniformly distributed Y ~ [0, 1], given a CDF F_x of a random variable X.
When X is continuous and inv(F_x) exists i.e. there exists bijection (one to one mapping) F_x: x -> y for each x in X to each y in Y, then we can easily derive an inverse distribution which we can use to map observations to a uniform random variable. If the transformed observation(s) follow a uniform distribution then we can say that the observations (or push-forward measure) matches the CDF F_x.
Otherwise, we can derive some sort of metric to determine the deviation of the observed distribution to the expected distribution (which is where I think this is going...). Using the uniform transform to guage this deviation makes it so that the entire sample domain is equally represented.
If the inverse inv(F_x) does not exist e.g. in the case that F_x is not bijective or is discontinuous, we can still formulate an alternative to it's inverse it by finding the greatest lower bound that satisfies F_x(x) >= y, depending on the underlying pdf, one might need to satisfy: inf { x: F_x(x) >= y } where inf is the infinum.
The more complex cases (inverse is not obvious being one of them) is probably what the other references address.
Cases with multiple discontinuities might be tricky especially depending on the domain of evaluation, whether the improper integrals converge or diverge etc. There may not be generalizable solutions, and it may be highly dependent on the distribution itself. But I have further reading to do on this.

Side note: A parallel to the above method that I've used in the past in quantization algorithms is to have a bijection inv(F_x): Y -> X, where the "optimal" quantizer is chosen uniformly spaced in the interval [0, 1] and extracting the corresponding quantization intervals in X ~ F_x, the true distribution of the underlying samples.

nikeethr commented 5 months ago

The integrals discussed in https://github.com/nci/scores/issues/142#issuecomment-2106520187 are actually more about approximating continuous integration with computations over finite bins. The PIT while a general concept, in our application actually pertains to "measurable sets", and the domain applied over "measurable functions".

As such, alternatives to the Riemann integral (the one used in calculus and I was intending approximations for in the linked post) are: Jordan, Borel, W. H. Young, Lebesgue, and potentially more - that can be applied to measurable sets.

I'm not sure how much of this the original code uses or needs knowledge of this (since the final equations maybe easily representable as basic sums - again more reading required). Regardless, the Lebesgue measure forms some of foundation and derivation of PIT in various linear/non-linear strategies: https://doi.org/10.1214/13-EJS823

nikeethr commented 5 months ago

Intuitive analysis i.

Probability is a measure of likelihood of an event happening, its measurements are in the range [0, 1]. The domain in which it is defined (supported) for a random process is the measurable space. Integrals in measurable spaces have a bit more relaxed requirements when compared to Riemann integrals due to some additive properties, to do with measure functions over unions of open sets. A measurable function (f) is defined if for all real a the set { x | f(x) > a} is measurable.

If we look at the definition of a CDF for a extended positive real domain: F(x) = P(X <= x) = integral(from=0, to=x, f=pdf). We note that this isn't too different to the generic definition of a measurable function. The other deduction is that if the pdf (probability distribution function) has "gaps", it does not necessarily mean it isn't measurable. Rather, the likelihood of an event occurring there is 0. So the measurement of any subsets of the real line that are empty w.r.t "measurability" can be simply be forced to zero. Which further implies that P(X <= x) (most likely) continues to be well defined in these "empty" regions, just that it's value doesn't change.

In particular, rainfall intuitively is a composite of at least two measurable functions or distributions f_n : no rain, f_r : rain. For this purpose assuming f_n is a point mass at 0, and f_r is a continuous pdf defined for a > 0, with a being the amount of rain. Since the domains are disjoint we can't have the events of rain and no rain at the same time. Regardless, we still can compose them together as a measurable function. The intuitive way of doing this is shifting up the CDF defined by f_r by the probability of "no-rain" f_n, and scaling f_r - f_n such that the integral over the entire domain still is 1 (by definition of probability). In fact, this rather crude explanation can be thought of the weighted aggregation of the two distributions, for which more formal definitions are present in https://doi.org/10.1214/13-EJS823

Follow ups:

w.r.t PIT, how do we then derive the inverse of this aggregated measure? E.g. if the probability of no-rain is 0.25, then does this necessitate that the domain of pdf_inv ~ uniform does not have a support below 0.25?
How do we adequately "disperse" the bins in order to calculate the PIT from samples, without misrepresenting the underlying distribution? Would there be a separate binning strategy for different parts of the domain?
Alternatively, the inverse is still supported in [0, 1], and the point mass at 0.25 is spread uniformly across [0, 0.25]; does this in effect mean a wider (or different) bin size for the cases where x = 0 i.e. no rain, in order to maintain uniformity?

nikeethr commented 5 months ago

Intuitive analysis ii.

Before going into the follow-ups it may be worth considering how an implementation for a "naïve" case would look like.

We first define some primitive constraints:

Measurable: a constraint that a type has the ability to be measured i.e. converted into a real value
(e.g. float) via a measurable function.
    - measure :: m -> Real  -- a mapping that outputs a real number for a given input of the
                               underlying type
    - m is Additive and Ordered
    - empty :: 0  -- the empty measure is always 0
    - is_valid    -- runtime check if a function is measurable
    - measure_range :: m -> m -> Real  -- usually continuous functions have a measure of 0 for a
                                          point measurement (since its domain is not countable). 
                                          And measurement over a range makes more sense.

Bounded: data type has a min and max. Used to define the domain.
    - min, max

WeightedPartition: a collection of sets that span a domain with associated weights
    - partitions :: [Partition]
    - weights :: Partition -> Real

Disjoint: a collection of sets that do not overlap.
    - is_valid  -- runtime check if the collection is disjoint

IntegralMeasure: A measurable that can be integrated
    - interval :: Partition -> Distance   -- the resolution of the integral for a given partition
                                             given by a distance measure.
    - integral :: PartitionBounds -> Real -- method used to summate the underlying measure for a
                                             partition domain
    - is Measurable, has Ordered, Disjoint, and Bounded domain

WeightedProbabilityMeasure: Now we can see what a probability measure (i.e. pdf) would look like
    - measure :: m -> [0, 1]
    - measure_range :: m -> m -> [0, 1]
    - partitions
    - weights
    - is Measurable, has an IntegralMeasure, WeightedPartition and for now assume it is Disjoint in its domain
    - is_valid  -- all associated validity checks. Additionally, measurement over entire domain must be 1 or tend to 1.

Then, naturally we can compose a measurement function of a pdf for rainfall for the contrived example. For rainfall with point mass at x = 0 with point probability p(x) = 0.25, and e.g. the +ve half of the standard normal curve for x > 0 as the associated pdf, the composition might look like:

measure: [1, pdf]
partitions: [x = 0, x > 0]
weights: [w, (1 - w)] where w is the measure at x = 0 e.g. 0.25
interval: [{0}, {every 0.1}], in this case the integral resolution of the pdf is 0.1, the pmf doesn't have a resolution since its just a single value.
integral 0 -> t, t >= 0:
```
[
    w * (measure 0) | t == 0,
    w * (measure 0) + (1 - w) * (sum i=0.1->t: (measure_range i i+0.1) * 0.1) | t > 0
]
```
Note: the integral here is just using the simple sum, though it can be replaced by other methods e.g. trapz

If we formulate the constraints this way, it allows for extensibility to arbitrary definitions prior to computing the PIT. The example above illustrates the combination of a single pmf and pdf, but is not restricted in its extensibility. We also need to explore a few more concepts, Invertible and partitions that are not Disjoint (i.e. overlapping), and their associated strategies.

nikeethr commented 5 months ago

Aside: A quick look at the current implementation shows that the cdf is pre-computed, and the input arguments indicate/hint at locations of the point mass, effectively acting as the partitioning and weighting logic in https://github.com/nci/scores/issues/142#issuecomment-2118825844.

Additionally, the integrals used in the implementation are mainly for variance calculations in the pit_scores outputs. They are used to make inferences on "dispersion", and are calculated using piece-wise linear approximations between partitions. Intuitively the integrand x * (1 - cdf(x)), x >= 0 should also be measurable, so most of https://github.com/nci/scores/issues/142#issuecomment-2118825844 should still be relevant. So there are overlapping data structures between pit calculations and the cdf calculations.

nci / scores

PIT #142