Comparing distributions

mitchelloharawild / distributional

Vectorised distributions for R

https://pkg.mitchelloharawild.com/distributional

GNU General Public License v3.0

94 stars 15 forks source link

Comparing distributions #16

Open mitchelloharawild opened 4 years ago

mitchelloharawild commented 4 years ago

Comparisons are a reasonable operation to perform on a vector, and so a reasonable approach for comparing distributions with numbers or distributions is required.

This is especially problematic when inverting a Box-Cox transformation with lambda < 0, which has an exception for certain values: https://github.com/tidyverts/fabletools/blob/9812804afd6602ed5ddd5dcb262bba9728c8c2de/R/box_cox.R#L28-L30

library(distributional)
fabletools::inv_box_cox(dist_normal(), -0.3)
#> Registered S3 methods overwritten by 'fabletools':
#>   method                  from          
#>   guide_geom.guide_level  distributional
#>   guide_train.level_guide distributional
#> Error: Can't compare lists with `vctrs_compare()`

^{Created on 2020-04-20 by the reprex package (v0.3.0)}

In order to support https://github.com/tidyverts/fabletools/issues/126, we now pass the distribution itself into the transformation function (so x is some dist_normal()). This allows us to make simplifications where possible, so that dist_normal(0,1)+1 gives N(1,1) rather than t(N(0,1)), and hence resolving #126 and creating simpler data objects where possible.

The issue is with how do you compare a distribution with a numeric. What should dist_normal() < 0 mean, and how does that compare with what the box_cox() code wants it to return. Some ideas include:

Return a transformed distribution which now has a quantile method that returns TRUE or FALSE depending on if the base quantile is <0 or >0.
Return the probability P(x<0)

Following on from this, should the choice here extend to comparing distributions? distA < distB?

mitchelloharawild commented 4 years ago

From discussions with @robjhyndman, it is reasonable to return a transformed distribution when comparing two distributions (or reasonably pointed out, a distribution with a degenerate distribution).

echasnovski commented 4 years ago

Hi, @mitchelloharawild! Stumbled upon this package when browsing through GitHub suggestions.

I would like to make a self-plug and mention my pdqr package. It implements methods for working with custom distribution functions (like base R's p-, d-, q-, and r-functions). You can create and work with any discrete and continuous distributions (approximated via piecewise-linear density).

Besides many other useful things, it can compare (and even sort) distributions. Output of comparison is a function for "boolean" distribution with probability of being true computed directly as limit of empirical estimation from simulations (as size of samples grows to infinity). You can read more in this help page.

Example:

library(pdqr)

# These are continuous d-functions (density functions)
d1 <- as_d(dnorm)
d2 <- as_d(dnorm, mean = 1)

# Output of comparison is a boolean pdqr-function: type "discrete" with values 0
# (FALSE) and 1 (TRUE).
d1 <= 0
#> Probability mass function of discrete type
#> Support: [0, 1] (2 elements, probability of 1: 0.5)

# You can compare distributions with one another
d1 <= d2
#> Probability mass function of discrete type
#> Support: [0, 1] (2 elements, probability of 1: ~0.76025)

# Probability can be extracted with corresponding `summ_*()` functions
summ_prob_true(d1 <= d2)
#> [1] 0.760251

^{Created on 2020-06-23 by the reprex package (v0.3.0)}

mitchelloharawild commented 4 years ago

Great, this is exactly what I had in mind for this functionality. Thanks for your comment!

mitchelloharawild commented 4 years ago

The last issue with supporting inv_box_cox would be the use of [. I think it is important if dist[numeric] is used, then it should index the distribution vector.

However if dist[dist] is used, is it reasonable to return a transformed distribution for this? In a sense it is a probabilistic indexing of the vector, which is a useful interpretation when computing quantiles (as is the common use case for forecasting with inv_box_cox()).

echasnovski commented 4 years ago

If in construct dist1[dist2] distribution dist2 is a single discrete one with values in 1:length(dist1), then output can be interpreted as a mixture of distributions in dist1. For example, if dist2 gives 0.5 probability for 1 and 2, then dist1[dist2] is a mixture of dist1[1] and dist1[2] with equal weights.

Following that, if dist2 is a vector of distributions each of which is a discrete with values in 1:length(dist1), output dist1[dist2] can be thought as a vector of mixtures.

If dist2 contains continuous distribution, then I struggle to imagine what a reasonable output should look like.

mitchelloharawild commented 4 years ago

Hmm, I hadn't considered the use of indexing by a singular discrete distribution - your proposal sounds reasonable (although fairly specific for the inputs). I currently have an experimental dist_mixture() for producing mixture distributions, so an alternative interface for this isn't immediately necessary.

In the context of inv_box_cox(y, lambda!=0) and computing quantiles:

x[x > -1/lambda] <- NA
x <- x * lambda + 1
sign(x) * abs(x)^(1/lambda)

I had in mind that when computing the quantile x > -1/lambda would return a 'boolean' distribution. The given quantile of that distribution would be TRUE or FALSE, which then indexes the original distribution using typical vector indexing. However this probability-based indexing isn't length stable, so perhaps it should only be defined for [<-?

echasnovski commented 4 years ago

To be clear, I currently struggle to see the "big picture" behind this package and others in 'tidyverts' family. So my thoughts are about isolated questions and should be considered with a grain of salt.

About inverse Box-Cox transformation (or any transformation really) of a vector of distributions. I believe the objective here should always be defining transformation of single distribution, vectorization then should come naturally.

About this particular example: x[x > -1/lambda] <- NA. It mainly states that in order for this transformation to be meaningful x (talking about single item here) can't represent values bigger than -1/lambda. If x is a single number, this is straightforward to verify. But if x is a distribution, there are at least several choices here:

If x can have values bigger than -1/lambda (translated into cdf(x, -1/lambda) < 1), it is "not eligible" for the transformation. Most intuitive approach for me at the moment.
x should be made to not have values bigger than -1/lambda (by trimming, winsorizing, etc.). If it only has values bigger than -1/lambda (translated into cdf(x, -1/lambda) == 0), then it is "not eligible" for the transformation.

So I guess my point here is that not mechanics of comparison between distribution and number should be of concern, but what that comparison represents.