Open nicholasloveday opened 8 months ago
Looks like the code from JIVE has been copied in which is useful.
I just noticed a couple of quick things, that maybe are non-issues but just dropping them here for future reference:
.where
which is usually fast if all the data is loaded in memory (usually the case in numpy
). But xarray
tends to lazy load things - so using .where
might actually be slow because the underlying C
implementation does not yet know the extent of the data at the point where it is called. A simple da.load()
or directly operating on numpy
by referencing da.values
may resolve this prior to .where
searches on not so small data. Pre-loading may make memory de-allocation a bit simple for the garbage collector.Thanks @nikeethr.
@rob-taggart also has some ideas on how the implementation can be improved before it's merged onto develop.
Extracting references:
Wikipedia definition
Just going to dump my notes to sound out my understanding, and add/correct them in further comments as my knowledge improves. So don't assume these as fact:
Assuming that the wikipedia definition is accurate (without direct access to its references), here are some foundations:
Y = F_x(X)
is uniformly distributed Y ~ [0, 1]
, given a CDF F_x
of a random variable X
.X
is continuous and inv(F_x)
exists i.e. there exists bijection (one to one mapping) F_x: x -> y
for each x
in X
to each y
in Y
, then we can easily derive an inverse distribution which we can use to map observations to a uniform random variable. If the transformed observation(s) follow a uniform distribution then we can say that the observations (or push-forward measure) matches the CDF F_x
.inv(F_x)
does not exist e.g. in the case that F_x
is not bijective or is discontinuous, we can still formulate an alternative to it's inverse it by finding the greatest lower bound that satisfies F_x(x) >= y
, depending on the underlying pdf, one might need to satisfy: inf { x: F_x(x) >= y }
where inf
is the infinum.Side note: A parallel to the above method that I've used in the past in quantization algorithms is to have a bijection
inv(F_x): Y -> X
, where the "optimal" quantizer is chosen uniformly spaced in the interval[0, 1]
and extracting the corresponding quantization intervals inX ~ F_x
, the true distribution of the underlying samples.
The integrals discussed in https://github.com/nci/scores/issues/142#issuecomment-2106520187 are actually more about approximating continuous integration with computations over finite bins. The PIT while a general concept, in our application actually pertains to "measurable sets", and the domain applied over "measurable functions".
As such, alternatives to the Riemann integral (the one used in calculus and I was intending approximations for in the linked post) are: Jordan, Borel, W. H. Young, Lebesgue, and potentially more - that can be applied to measurable sets.
I'm not sure how much of this the original code uses or needs knowledge of this (since the final equations maybe easily representable as basic sums - again more reading required). Regardless, the Lebesgue measure forms some of foundation and derivation of PIT in various linear/non-linear strategies: https://doi.org/10.1214/13-EJS823
Intuitive analysis i.
Probability is a measure of likelihood of an event happening, its measurements are in the range [0, 1]
. The domain in which it is defined (supported) for a random process is the measurable space. Integrals in measurable spaces have a bit more relaxed requirements when compared to Riemann integrals due to some additive properties, to do with measure functions over unions of open sets. A measurable function (f
) is defined if for all real a
the set { x | f(x) > a}
is measurable.
If we look at the definition of a CDF for a extended positive real domain: F(x) = P(X <= x) = integral(from=0, to=x, f=pdf)
. We note that this isn't too different to the generic definition of a measurable function. The other deduction is that if the pdf
(probability distribution function) has "gaps", it does not necessarily mean it isn't measurable. Rather, the likelihood of an event occurring there is 0. So the measurement of any subsets of the real line that are empty w.r.t "measurability" can be simply be forced to zero. Which further implies that P(X <= x)
(most likely) continues to be well defined in these "empty" regions, just that it's value doesn't change.
In particular, rainfall intuitively is a composite of at least two measurable functions or distributions f_n : no rain
, f_r : rain
. For this purpose assuming f_n
is a point mass at 0, and f_r
is a continuous pdf
defined for a > 0
, with a
being the amount of rain. Since the domains are disjoint we can't have the events of rain
and no rain
at the same time. Regardless, we still can compose them together as a measurable function. The intuitive way of doing this is shifting up the CDF defined by f_r
by the probability of "no-rain" f_n
, and scaling f_r - f_n
such that the integral over the entire domain still is 1 (by definition of probability). In fact, this rather crude explanation can be thought of the weighted aggregation of the two distributions, for which more formal definitions are present in https://doi.org/10.1214/13-EJS823
Follow ups:
pdf_inv ~ uniform
does not have a support below 0.25?[0, 1]
, and the point mass at 0.25 is spread uniformly across [0, 0.25]
; does this in effect mean a wider (or different) bin size for the cases where x = 0
i.e. no rain, in order to maintain uniformity?Intuitive analysis ii.
Before going into the follow-ups it may be worth considering how an implementation for a "naïve" case would look like.
We first define some primitive constraints:
Measurable: a constraint that a type has the ability to be measured i.e. converted into a real value
(e.g. float) via a measurable function.
- measure :: m -> Real -- a mapping that outputs a real number for a given input of the
underlying type
- m is Additive and Ordered
- empty :: 0 -- the empty measure is always 0
- is_valid -- runtime check if a function is measurable
- measure_range :: m -> m -> Real -- usually continuous functions have a measure of 0 for a
point measurement (since its domain is not countable).
And measurement over a range makes more sense.
Bounded: data type has a min and max. Used to define the domain.
- min, max
WeightedPartition: a collection of sets that span a domain with associated weights
- partitions :: [Partition]
- weights :: Partition -> Real
Disjoint: a collection of sets that do not overlap.
- is_valid -- runtime check if the collection is disjoint
IntegralMeasure: A measurable that can be integrated
- interval :: Partition -> Distance -- the resolution of the integral for a given partition
given by a distance measure.
- integral :: PartitionBounds -> Real -- method used to summate the underlying measure for a
partition domain
- is Measurable, has Ordered, Disjoint, and Bounded domain
WeightedProbabilityMeasure: Now we can see what a probability measure (i.e. pdf) would look like
- measure :: m -> [0, 1]
- measure_range :: m -> m -> [0, 1]
- partitions
- weights
- is Measurable, has an IntegralMeasure, WeightedPartition and for now assume it is Disjoint in its domain
- is_valid -- all associated validity checks. Additionally, measurement over entire domain must be 1 or tend to 1.
Then, naturally we can compose a measurement function of a pdf for rainfall for the contrived example. For rainfall with point mass at x = 0
with point probability p(x) = 0.25
, and e.g. the +ve half of the standard normal curve for x > 0
as the associated pdf, the composition might look like:
[1, pdf]
[x = 0, x > 0]
[w, (1 - w)]
where w
is the measure at x = 0
e.g. 0.25[{0}, {every 0.1}]
, in this case the integral resolution of the pdf
is 0.1, the pmf
doesn't have a resolution since its just a single value.integral 0 -> t, t >= 0
:
[
w * (measure 0) | t == 0,
w * (measure 0) + (1 - w) * (sum i=0.1->t: (measure_range i i+0.1) * 0.1) | t > 0
]
Note: the integral here is just using the simple sum, though it can be replaced by other methods e.g.
trapz
If we formulate the constraints this way, it allows for extensibility to arbitrary definitions prior to computing the PIT
. The example above illustrates the combination of a single pmf
and pdf
, but is not restricted in its extensibility. We also need to explore a few more concepts, Invertible
and partitions that are not Disjoint
(i.e. overlapping), and their associated strategies.
Aside: A quick look at the current implementation shows that the cdf
is pre-computed, and the input arguments indicate/hint at locations of the point mass, effectively acting as the partitioning and weighting logic in https://github.com/nci/scores/issues/142#issuecomment-2118825844.
Additionally, the integrals used in the implementation are mainly for variance calculations in the pit_scores
outputs. They are used to make inferences on "dispersion", and are calculated using piece-wise linear approximations between partitions. Intuitively the integrand x * (1 - cdf(x)), x >= 0
should also be measurable, so most of https://github.com/nci/scores/issues/142#issuecomment-2118825844 should still be relevant. So there are overlapping data structures between pit calculations and the cdf calculations.
Migrate the PIT function across from Jive.