Probabilistic indexing of distributions - Githubissues

mitchelloharawild / distributional

Vectorised distributions for R

https://pkg.mitchelloharawild.com/distributional

GNU General Public License v3.0

94 stars 15 forks source link

Probabilistic indexing of distributions #34

Open mitchelloharawild opened 4 years ago

mitchelloharawild commented 4 years ago

Related to (and from) #16:

The last issue with supporting inv_box_cox would be the use of [. I think it is important if dist[numeric] is used, then it should index the distribution vector.

However if dist[dist] is used, is it reasonable to return a transformed distribution for this? In a sense it is a probabilistic indexing of the vector, which is a useful interpretation when computing quantiles (as is the common use case for forecasting with inv_box_cox()).

mitchelloharawild commented 3 years ago

Also worth considering is the use of [[, discussed in #29. It currently extracts the member distribution class, which cannot be combined with other distributions classes. Perhaps it should instead retain the manager <distribution> class to and indicate a 'single value' extract, much like:

seq_len(10)[1]
#> [1] 1
seq_len(10)[[1]]
#> [1] 1
seq_len(10)[1:2]
#> [1] 1 2
seq_len(10)[[1:2]]
#> Error in seq_len(10)[[1:2]]: attempt to select more than one element in vectorIndex

^{Created on 2020-08-06 by the reprex package (v0.3.0)}

mitchelloharawild commented 3 years ago

I think using the [[.list fallback method for distributions is definitely surprising and needs fixing. This will be even more obvious with changes from #25 that will change the list structure of <distribution>.

library(distributional)
dist <- dist_normal(1:10, sigma = 1:10)
dist[1]
#> <distribution[1]>
#> [1] N(1, 1)
dist[[1]]
#> N(1, 1)
dist[1:2]
#> <distribution[2]>
#> [1] N(1, 1) N(2, 4)
dist[[1:2]]
#> [1] 1

^{Created on 2020-08-06 by the reprex package (v0.3.0)}

This will likely be a breaking change for {ggdist} (@mjskay), as the output for passing multiple values to methods like quantile(dist, c(0.025, 0.975)) will likely differ from qnorm(c(0.025, 0.975), <dist_args>) due to the vectorisation of dist.

I want to avoid how q*() vectorises across both p and <dist_args>:

qnorm(c(0.025, 0.975), mean = 0:1, sd = 1:2)
#> [1] -1.959964  4.919928

^{Created on 2020-08-06 by the reprex package (v0.3.0)}

Opinions encouraged on the appropriate output for multiple vectorisation like this. I am currently thinking of something like this:

library(distributional)
quantile(dist_normal(0:1, 1:2), c(0.025, 0.975))
#> # A tibble: 2 x 2
#>   `0.025` `0.975`
#>     <dbl>   <dbl>
#> 1   -1.96    1.96
#> 2   -2.92    4.92

^{Created on 2020-08-06 by the reprex package (v0.3.0)}

Which allows mutate() to do this:

library(distributional)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
tibble(
  dist = dist_normal(0:1, 1:2)
) %>% 
  mutate(
    quantile(dist, c(0.025, 0.975))
  )
#> # A tibble: 2 x 3
#>      dist `0.025` `0.975`
#>    <dist>   <dbl>   <dbl>
#> 1 N(0, 1)   -1.96    1.96
#> 2 N(1, 4)   -2.92    4.92

^{Created on 2020-08-06 by the reprex package (v0.3.0)}

mjskay commented 3 years ago

Ah yes, the vectorization of quantile/cdf/pdf was another thing I wanted to ask about! :)

The data frame approach is neat, though I wonder how that generalizes down to replicating the interface of base quantile() (or the other functions) when you have a distribution vector of length 1. What about a matrix form (N distributions x Q quantiles) instead of a tibble? That way with a distribution vector of length 1 it would be a 1 x Q matrix, which would be more similar in form to the length-Q vector you would get from base quantile(). Not sure what mutate does with that, but worst-case scenario you could do as.data.frame on that output...

mitchelloharawild commented 3 years ago

Internally, I think that the distribution methods (say quantile.dist_normal()) will return a matrix as you describe above. This should be more efficient than working with data frames. The quantile.distribution() method will then organise those matrices into a structure that is most useful/convenient for the user. I'm not sure what that format should be.

The other complication is what should be returned from multiple quantiles of multiple multivariate distributions. I don't have a good answer for that issue yet.

mjskay commented 3 years ago

Thinking about this a bit more, especially the multivariate question, I would suggest always returning a matrix / array instead of a tibble. Mostly because it is a consistent format you can clearly define an interface for in both the univariate and multivariate cases. Take:

q = quantile(x, probs)

Where q is an array whose shape depends on dim(probs):

if dim(probs) = 1, then dim(q) will equal dim(x) (ie q is the same shape as x) where each entry q[i, j,...] is the probs quantile of x[i, j,...]
if dim(probs) = dim(x), then again dim(q) will be equal to dim(x) (and dim(probs)), but now each entry q[i, j,...] is the probs[i, j,...] quantile of x[i, j,...]
if dim(probs) = c(dim(x), N) for some scalar N (ie probs is the same shape as x but with one extra dimension to hold more than one quantile per distribution), then dim(q) will equal dim(probs) where each entry q[i, j,..., n] is the probs[i, j,..., n] quantile of x[i, j,...]
if dim(probs) has the same shape as in the previous case except that some dimensions have length 1, broadcast those dimensions to the size of the corresponding dimension in x then follow the previous case.

This means your example above would have to yield something a bit different, but I think consistency at higher dimensions would be maintained. Vectors without dimensions assigned would be treated as column vectors, so this:

quantile(dist_normal(0:1, 1), c(0.5, 0.75))

Would yield a column vector with the 0.5 quantile of normal(0,1) and the 0.75 quantile of normal(1,1). I think this is more consistent with the vectorization you should expect in base R, and even when using tidyverse stuff. Eg if you had a column containing a different prob for each row in a table you'd want the above behavior. Take something like this:

df = tibble(dist = dist_normal(0:1, 1), plow = c(.1, .2), phigh = c(.9, .8))

Then you would expect something like this to give the 0.1 quantile of normal(0,1) and the 0.2 quantile of normal(1,1):

df %>% mutate(lower = quantile(dist, plow))

And if you wanted multiple quantiles per dist you would do something like:

df %>% mutate(qs = quantile(dist, cbind(plow, phigh))

Or:

df %>% mutate(quantile(dist, cbind(plow, phigh))

Or if you wanted to directly specify multiple quantiles (all applied to each dist) you would do:

df %>% mutate(qs = quantile(dist, cbind(0.1, 0.9))

Or:

df %>% mutate(qs = quantile(dist, t(c(0.1, 0.9)))

I think this approach provides a consistent interface in the univariate case (rows are distributions, columns in both input and output are probs/quantiles) that generalizes to the multivariate case (last dimension is always probs/quantiles).

I would caution against doing too much magic with trying to figure out the user's desired output type and just go with a consistent interface; when I designed the interface to median_qi and related functions in tidybayes I tried to make it smart about guessing output types (arrays or tibbles) based on input types and realized (years later and too late) that it mostly makes the API harder to use, not easier.