Vectorisation of distributions and operation arguments

mitchelloharawild / distributional

Vectorised distributions for R

https://pkg.mitchelloharawild.com/distributional

GNU General Public License v3.0

94 stars 15 forks source link

Vectorisation of distributions and operation arguments #52

Closed davidtedfordholt closed 2 years ago

davidtedfordholt commented 3 years ago

It would be nice to be able to pull multiple quantiles from a single distribution at the same time.

dist <- dist_sample(list(1:10))
quantile(dist, .05)
dists <- c(dist, dist_normal())
quantile(dists, .05)
quantile(dist, p = c(.05, .95))
quantile(dists, p = c(.05, .95))

My thought is that the expected output of quantile(dist, p = c(.05, .95)) would be a vector of the values, just like the output of quantile(dists, .05).

I know that it's trickier with the output of quantile(dists, p = c(.05, .95)). A list of the distributions with a vector of the quantiles makes sense to me, but that's a bigger API question.

mjskay commented 3 years ago

I had made a slightly overlong suggestion of what a general solution to this might be in a comment here: https://github.com/mitchelloharawild/distributional/issues/34#issuecomment-670360767

Happy to try to clarify. I think that it might be worth thinking about this before going with a simple solution because it might be hard to go backwards once a simpler version is released.

davidtedfordholt commented 3 years ago

I definitely understand the desire for consistency. The reality is that I'm personally leaning in the direction of consistency with stats. Which isn't to imply that it's the right call to be consistent with stats::quantile.default(), just that it is another facet of consistency.

The reality is that, if quantile.distribution() is going to change to always outputting something other than a vector, that's going to be a significant (though potentially advantageous and totally acceptable) change that will probably necessitate a lot of other changes, whether or not anything is done at the moment about allowing multiple probabilities for a single distribution.

mjskay commented 3 years ago

I largely agree with consistency with stats::quantile.default(), and I think my suggestion is mostly compatible with this: i.e. that quantile(dist_normal(0:1, 1), p = c(.6, .7)) would output a vector with the .6 quantile of normal(0,1) and the .7 quantile of normal(1,1). Where I think it would diverge from stats is in recycling behavior: personally I would largely avoid it if possible, only recycling dimensions whose length is 1. Otherwise recycling gets messy when you go to the multidimensional setting.

To summarize my suggestion: the input array of distributions (x) and the array of probabilities (p) would be broadcast to the same shape (say k dimensions), except that p could (optionally) have a k+1th dimension to allow multiple quantiles to be requested from each element of x. Dimensions of x or p of size 1 would be broadcast up so that x and p have the size for every dimension up to k.

In the case where p has an extra dimension, the result would be an array with k+1 dimensions. If p has k or fewer dimensions (instead of k+1), which implies only one quantile per element of x, you get back an array with k dimensions holding one quantile per element of x.

This generalizes the special case of (mostly) stats-like behavior, where if x and p are both vectors, they can be broadcast to the same length and then a vector of that length is returned, where each element is the quantile from the broadcasted version of p of the corresponding element in the broadcasted version of x.

I think the only thing that doesn't cover in stats behavior is recycling of vectors of length other than 1, which in my opinion is usually a mistake anyway.

davidtedfordholt commented 3 years ago

Does vctrs play nicely enough with arrays that the general shift toward vector-based operations won't be tripped up by the fact that not everything treats a one-dimensional array as perfectly equivalent to a vector?

I ask because it sounds like your idea is to always return an array, though often a one-dimensional one, and I don't know what second- and third-order effects that would have.

davidtedfordholt commented 3 years ago

Also, if the answer to the above is "it does just fine," then it seems like allowing an initial method for either multiple distributions or multiple probability values, both of which output a vector the length of whichever is not 1, matches what you would expect to get from a version that, in those cases, is producing one-dimensional arrays.

I recognize that I am ~98% practitioner, and so am probably too willing to provide a partial solution to something, as long as the API isn't expected to change when a more robust solution comes along. It seems like the main difference would be in the dimnames of the array, which provides additional flexibility which can be helpful. I can't see how it materially changes how you would access the information inside it, however, in the case of multiple probability values. It should just potentially enhance the ability to address things later on.

mjskay commented 3 years ago

Also, if the answer to the above is "it does just fine," then it seems like allowing an initial method for either multiple distributions or multiple probability values, both of which output a vector the length of whichever is not 1, matches what you would expect to get from a version that, in those cases, is producing one-dimensional arrays.

Oh yeah that makes sense, seems reasonable.

(FWIW I am just spouting my opinion about what a consistent API for this could look like, I have no idea what @mitchelloharawild thinks, who is the person you actually need to convince :) )

mitchelloharawild commented 3 years ago

Thanks for the PR and discussion! I haven't had the chance to read through (and fully process) the discussion above yet, but I'll be back to thinking about this issue mid next week. I'm currently focused on representing and working with multiple temporal granularities, in which the ideas for https://github.com/mitchelloharawild/distributional/issues/25 are being tested.

I do think that some of the recycling rules in {stats} for distributions needs improvement. Some of my first ideas for this are discussed here: https://github.com/mitchelloharawild/distributional/issues/34#issuecomment-669728038 However I agree that returning a data frame for things that could be a vector isn't nice, I wonder if a matrix would work nicely in a tibble, where a separate column is used for each requested quantile?

Regarding this behaviour:

qnorm(c(0.025, 0.975), mean = 0:1, sd = 1:2)
#> [1] -1.959964  4.919928

I don't know of a reason for which someone would want to compute two different quantiles for two different distributions. Is this a useful feature? If it is, perhaps this could be a non-default option to control the recycling rules? Whenever I've used qnorm() with multiple input distributions and quantiles it was with the distributions replicated to give multiple quantiles for each distribution.

davidtedfordholt commented 3 years ago

I definitely can't think of a time when I would want that qnorm behavior. My expectation of the output of that would be all 8 results. I would accept (though find strange) the idea of 4 results, with two quantiles each for N[0,1] and N[1,2].

mjskay commented 3 years ago

I don't know of a reason for which someone would want to compute two different quantiles for two different distributions. Is this a useful feature? If it is, perhaps this could be a non-default option to control the recycling rules? Whenever I've used qnorm() with multiple input distributions and quantiles it was with the distributions replicated to give multiple quantiles for each distribution.

Say you had a long format data frame of quantiles and distributions for the purposes of constructing something like a quantile dotplot. You might have a table like this:

df = tibble(p = rep(ppoints(20),2), dist = rep(dist_normal(1:2,1), each = 20))
df

# A tibble: 40 x 2
       p    dist
   <dbl>  <dist>
 1 0.025 N(1, 1)
 2 0.075 N(1, 1)
 3 0.125 N(1, 1)
 4 0.175 N(1, 1)
 5 0.225 N(1, 1)
 6 0.275 N(1, 1)
 7 0.325 N(1, 1)
 8 0.375 N(1, 1)
 9 0.425 N(1, 1)
10 0.475 N(1, 1)
11 0.525 N(1, 1)
12 0.575 N(1, 1)
13 0.625 N(1, 1)
14 0.675 N(1, 1)
15 0.725 N(1, 1)
16 0.775 N(1, 1)
17 0.825 N(1, 1)
18 0.875 N(1, 1)
19 0.925 N(1, 1)
20 0.975 N(1, 1)
21 0.025 N(2, 1)
22 0.075 N(2, 1)
23 0.125 N(2, 1)
24 0.175 N(2, 1)
25 0.225 N(2, 1)
26 0.275 N(2, 1)
27 0.325 N(2, 1)
28 0.375 N(2, 1)
29 0.425 N(2, 1)
30 0.475 N(2, 1)
31 0.525 N(2, 1)
32 0.575 N(2, 1)
33 0.625 N(2, 1)
34 0.675 N(2, 1)
35 0.725 N(2, 1)
36 0.775 N(2, 1)
37 0.825 N(2, 1)
38 0.875 N(2, 1)
39 0.925 N(2, 1)
40 0.975 N(2, 1)

I think a user should rightly expect this to create a q column with one quantile per row:

df %>%
  mutate(q = quantile(dist, p))

Which would also be consistent with the stats rules.

More generally I am wary of interfaces that try to anticipate what the user "wants" rather than coming up with an internally consistent (and ideally small) set of rules for determining what makes sense. APIs that try to predict what users want (changing output format depending on it) will initially seem like they work well for every use case you've thought of as the API designer --- but are brittle when users try to do something you haven't explicitly thought of. That's why I am advocating for a strictly defined output type that will generalize to the multivariate case. I don't think that tibbles will be able to do that.

davidtedfordholt commented 3 years ago

@mjskay , I'm lucky in that @mitchelloharawild has no problem setting me straight when I need it. Listening in on these sorts of discussions is how I learn. That and seeing folks re-implement things I've implemented, but better!

mitchelloharawild commented 3 years ago

I think a user should rightly expect this to create a q column with one quantile per row:
df %>%
  mutate(q = quantile(dist, p))
Which would also be consistent with the stats rules.

This is how it would be done if distributional also uses the recycling rules from {stats}.

The issue I have with this is that the distributions (and quantile probabilities) need to be repeated, despite there only being 2 distributions that are being drawn from. Additionally in the above example (and the majority of applications I can think of), the desired quantiles match for each distribution. This interface is more appealing if the requested probabilities differ for each distribution.

If this is useful (even for transitioning to {distributional}), it can be added as an argument. quantile(dist, p, expand_interaction = FALSE). The argument name needs improving, but it could be used for this situation.

More generally I am wary of interfaces that try to anticipate what the user "wants" rather than coming up with an internally consistent (and ideally small) set of rules for determining what makes sense. APIs that try to predict what users want (changing output format depending on it) will initially seem like they work well for every use case you've thought of as the API designer --- but are brittle when users try to do something you haven't explicitly thought of. That's why I am advocating for a strictly defined output type that will generalize to the multivariate case. I don't think that tibbles will be able to do that.

100% agree with the importance of a consistent interface for this. The difficulty is choosing an output which generalises to more complex situations (multivariate, multiple quantiles) whilst preserving the simplicity of common usage (univariate, single quantile).

mjskay commented 3 years ago

This is how it would be done if distributional also uses the recycling rules from {stats}.

True, but it is also how it would be done using broadcasting rules closer to the (now-defunct) {rray}, which I think are better broadcasting rules and could also be applied here. I have been applying a subset of those rules with the experimental rvar interface in {posterior} and they work pretty well. Basically the idea is to broadcast dimensions of size 1 only; any other broadcasting is an error. It is simple and consistent and covers many use cases.

100% agree with the importance of a consistent interface for this. The difficulty is choosing an output which generalises to more complex situations (multivariate, multiple quantiles) whilst preserving the simplicity of common usage (univariate, single quantile).

I am curious what you think of the interface I proposed in https://github.com/mitchelloharawild/distributional/issues/52#issuecomment-713869277 ? I think it maintains the simplicity of the univariate case: quantile(x = normal(0,1), p = c(.66, .95)) would cause x to be broadcast to the same size as p and therefore a vector of length 2 returned. Meanwhile quantile(x = normal(1:2,1), p = c(.66, .95)) would return two quantiles (the .66 for the first and the .95 for the second), but quantile(x = normal(1:2,1), p = cbind(.66, .95)) would return a 2x2 matrix, with one column the .66 quantile of each distribution and the other column the .95 quantile of each distribution. If x were

Thus both desired behaviors (quantiles along the distribution vector or multiple quantiles for each element of the distribution) are supported, it generalizes to multidimensional cases, and it supports a form of the stats behavior (which also makes it consistent with potential vectorized usage in long format data frames). And no additional arguments are needed to do it.

An alternative in a similar vein might be to make the first dimension be the dimension containing multiple quantiles (which might be more compatible with how you are imagining usage?), in which case the returned array might have an extra first dimension to hold the quantiles in the multivariate case? What I find less attractive about that is it would yield surprising (to me) behavior if you attempted to use it where x and p have the same length.

davidtedfordholt commented 3 years ago

What I find bothersome about the idea of quantile(x = normal(1:2,1), p = c(.66, .95)) is that it only seems to make sense for a small number of cases where it's incredibly easy to specify both the distributions and probabilities. If you can't express it with, at most complicated, a call to seq(), you're far more likely to end up with a data frame with a distribution column and a probability column. In that case, the output is simply another column and all is well with the world. I'm struggling to find an instance where it makes sense to use distributional, you would want to get a specific but different quantile from each distribution, they incredibly simple to specify, but you would be problematically increasing complexity to throw those two specifications into columns in a data frame.

(I apologize if I seem flippant!)

mjskay commented 3 years ago

It doesn't sound flippant, it's a hard question.

I think I am more concerned with having consistent semantics than considering every use case. I think that quantile, cdf, and density should generally return vectors in most typical use cases. This helps them be as flexible as possible in being incorporated into higher-level code. My opinion is probably biased here in that it could make my life easier as far as building {ggdist} things on top of {distributional}, but I also think it is a property that could make the API more predictable and more familiar to new users.

As an example of different quantiles with different distributions, say someone wanted to make quantile dotplots with different denominators (different numbers of dots) on different distributions. The number of dots in a quantile dotplot may have to be tuned to the task at hand and/or the distribution being displayed, so this is not an unrealistic example. One very reasonable way to do this would be to expand the distributions and the probabilities into a long-format data frame using something like base::expand.grid() or tidyr::expand() then using quantile with them. In this case, your long format data frame would have both different distributions and different probabilities, and the distributions may not even be of the same class. Certainly you could do this other ways (e.g. with list columns), but not everyone uses list columns in their typical workflows; and besides, allowing this style of vectorization would support both a list column approach and a long format data frame approach. That flexibility is a good sign: one sign of a good API is that it can be adapted easily to new situations without having to add new corner cases or exceptions.

I think the other thing to consider is not only if the semantics make sense for quantile, but also cdf and density. Whatever rule you come up with for quantile should be the same rule for those other functions, otherwise the API will not be self-consistent and will be more confusing to use. quantile and cdf, for example, should probably be inverses.

davidtedfordholt commented 3 years ago

It feels like one framing for this question is where we put the burden of added complexity of use, and perhaps it's even falling into that problematic grey area of whether the complexity comes when you're leaning base or tidy (or perhaps matrix or df-derivative). Is that fair? Since my interaction with the package ends up being almost exclusively through {fable}, that certainly colours my answers as much as {ggdist} might yours!

mjskay commented 3 years ago

Yeah, I think level of complexity is another good consideration. I guess my assumption was that quantile, cdf, density are lower-level functions for the most part. I don't see the vast majority of people as needing them, and the people who I do think need them are going to be the same people who are already familiar with their corresponding functions in base and who are wanting to use these functions in their place. But my assumptions could be incorrect.

For example, the use case of wanting multiple quantiles per distribution in tibble format seems most likely to come up when someone is trying to construct intervals, in which case hilo() is probably what they should be using anyway, not quantile().

mitchelloharawild commented 3 years ago

Finally found (stole :laughing:) some time to work on this, sorry for the wait! Thanks for the valuable discussion here, it was useful to refer back to.

Here is what I've implemented so far, some feedback is appreciated. In short, multiple inputs are provided via a named list. Multiple outputs are provided via a data frame. Use of matrices is reserved for working with multivariate distributions, of which multiple inputs to a multivariate distribution is given via a list of matrices.

These changes should improve performance for things like dist_sample(), where now it is possible to use the same kernel density for multiple inputs. I haven't yet tested the distribution methods to see if they work over multiple inputs, but if they follow the r/p/d/q standard recycling it should work.

Broadcasting rules of quantile(`<dist>`, p)

(and other functions on distributions)

library(tidyverse)
library(distributional)

# <dist> is a distribution vector of length 10
dist <- dist_normal(1:10, 1:10)

# p = 0.5:          apply p across all elements of <dist> (recycling p onto <dist>)
#                   Returns a vector of length 10
quantile(dist, 0.5)
#>  [1]  1  2  3  4  5  6  7  8  9 10

# p = c(0.5, 0.9):  Cannot recycle p (length 2) onto <dist> (length 10)
#                   Returns an error.
quantile(dist, c(0.5, 0.9))
#> Error: Cannot recycle input of size 2 to match the distributions (size 10).

# p = ppoints(10):  apply each p to each element of <dist> (no recycling)
#                   Returns a vector of length 10
quantile(dist, ppoints(10))
#>  [1] -0.5466352714 -0.0009810912  1.0337294843  2.4981529191  4.3870957806
#>  [6]  6.7354850633  9.6282323916 13.2433880419 18.0044149106 25.4663527140

# p = list(0.5):    equivalent to p = 0.5, but returns a list output
#                   Returns a tibble with 1 vector column of length 10
quantile(dist, lst(0.5))
#>    0.5
#> 1    1
#> 2    2
#> 3    3
#> 4    4
#> 5    5
#> 6    6
#> 7    7
#> 8    8
#> 9    9
#> 10  10

# p = list(c(0.5, 0.9)):
#                   Cannot recycle p[[1]] (length 2) onto <dist> (length 10)
#                   Returns an error.
quantile(dist, tibble::lst(c(0.5, 0.9)))
#> Error: Cannot recycle input of size 2 to match the distributions (size 10).

# p = list(p1, p2): equivalent to list(quantile(<dist>, p1), quantile(<dist>, p2)).
#                   Returns a tibble with 2 vector columns of length 10
#                   Names of p are used in output.
quantile(dist, lst(ppoints(10), runif(10)))
#>      ppoints(10) runif(10)
#> 1  -0.5466352714 -1.013120
#> 2  -0.0009810912  1.649167
#> 3   1.0337294843  9.149502
#> 4   2.4981529191  1.108359
#> 5   4.3870957806  5.285741
#> 6   6.7354850633 -5.824289
#> 7   9.6282323916  6.686294
#> 8  13.2433880419 14.274683
#> 9  18.0044149106 15.124354
#> 10 25.4663527140 -3.360279

Returning vectors for simple cases works nicely everywhere, and is the natural output format.

tibble(dist = dist) %>% 
  mutate(quantile(dist, 0.5))
#> # A tibble: 10 x 2
#>          dist `quantile(dist, 0.5)`
#>        <dist>                 <dbl>
#>  1    N(1, 1)                     1
#>  2    N(2, 4)                     2
#>  3    N(3, 9)                     3
#>  4   N(4, 16)                     4
#>  5   N(5, 25)                     5
#>  6   N(6, 36)                     6
#>  7   N(7, 49)                     7
#>  8   N(8, 64)                     8
#>  9   N(9, 81)                     9
#> 10 N(10, 100)                    10

Using a tibble when outputting multiple values works nicely with tibble columns and automatic unpacking in dplyr (>=1.0.0).

# Unnamed mutate() arguments are automatically unpacked
tibble(dist = dist) %>% 
  mutate(
    quantile(dist, list(q50 = 0.5, q90 = 0.9))
  )
#> # A tibble: 10 x 3
#>          dist   q50   q90
#>        <dist> <dbl> <dbl>
#>  1    N(1, 1)     1  2.28
#>  2    N(2, 4)     2  4.56
#>  3    N(3, 9)     3  6.84
#>  4   N(4, 16)     4  9.13
#>  5   N(5, 25)     5 11.4 
#>  6   N(6, 36)     6 13.7 
#>  7   N(7, 49)     7 16.0 
#>  8   N(8, 64)     8 18.3 
#>  9   N(9, 81)     9 20.5 
#> 10 N(10, 100)    10 22.8

# Named mutate() arguments produce named tibble colums
tibble(dist = dist) %>% 
  mutate(
    quantiles = quantile(dist, lst(0.5, 0.9))
  )
#> # A tibble: 10 x 2
#>          dist quantiles$`0.5` $`0.9`
#>        <dist>           <dbl>  <dbl>
#>  1    N(1, 1)               1   2.28
#>  2    N(2, 4)               2   4.56
#>  3    N(3, 9)               3   6.84
#>  4   N(4, 16)               4   9.13
#>  5   N(5, 25)               5  11.4 
#>  6   N(6, 36)               6  13.7 
#>  7   N(7, 49)               7  16.0 
#>  8   N(8, 64)               8  18.3 
#>  9   N(9, 81)               9  20.5 
#> 10 N(10, 100)              10  22.8

Putting it all together with other functions one might use with distributions

tibble(dist = dist) %>% 
  mutate(
    quantiles = quantile(dist, lst(0.5, 0.9, ppoints(10))),
    cdfs = cdf(dist, lst(0.5, 0.9)),
    densities = density(dist, lst(0, 1)),
    mean(dist),
    variance(dist),
    hilo = hilo(dist, lst(50, 95))
  )
#> # A tibble: 10 x 7
#>          dist quantiles$`0.5` $`0.9` $`ppoints(10)` cdfs$`0.5` $`0.9`
#>        <dist>           <dbl>  <dbl>          <dbl>      <dbl>  <dbl>
#>  1    N(1, 1)               1   2.28         -0.547      0.309  0.460
#>  2    N(2, 4)               2   4.56         -1.09       0.227  0.291
#>  3    N(3, 9)               3   6.84         -1.64       0.202  0.242
#>  4   N(4, 16)               4   9.13         -2.19       0.191  0.219
#>  5   N(5, 25)               5  11.4          -2.73       0.184  0.206
#>  6   N(6, 36)               6  13.7          -3.28       0.180  0.198
#>  7   N(7, 49)               7  16.0          -3.83       0.177  0.192
#>  8   N(8, 64)               8  18.3          -4.37       0.174  0.187
#>  9   N(9, 81)               9  20.5          -4.92       0.172  0.184
#> 10 N(10, 100)              10  22.8          -5.47       0.171  0.181
#> # … with 4 more variables: densities <df[,2]>, mean(dist) <dbl>,
#> #   variance(dist) <dbl>, hilo <df[,2]>

Note that using list() for these input arguments conflicts with the previous interface for multivariate distributions. I do however think that inputs to multivariate distributions are better provided with a matrix.

Inputting and outputting matrices helps to disambiguate between multivariate and multiple input results.

dist <- dist_multinomial(size = 4, prob = list(c(0.3, 0.5, 0.2)))
dimnames(dist) <- c("a", "b", "c")
density(dist, cbind(1, 2, 1))
#> [1] 0.18
density(dist, list(caseA=cbind(1, 2, 1), caseB=cbind(2, 1, 1)))
#>   caseA caseB
#> 1  0.18 0.108
density(dist, cbind(c(1,2), c(2,1), c(1,1)))
#> Error: Cannot recycle input of size 2 to match the distributions (size 1).
mean(dist)
#>        a b   c
#> [1,] 1.2 2 0.8

This can also work well in a tibble format

dist <- dist_multivariate_normal(mu = list(c(1,2), c(3,5)), 
                                 sigma = list(matrix(c(4,2,2,3), ncol=2),
                                              matrix(c(5,1,1,4), ncol=2)))
dimnames(dist) <- c("a", "b")
tibble(dist) %>% 
  mutate(
    q50 = quantile(dist, 0.5),
    qmix = quantile(dist, lst(0.5, varied = c(0.8, 0.3))),
    density(dist, cbind(2, 3)),
    mean(dist),
    variance(dist)
  )
#> # A tibble: 2 x 6
#>     dist q50[,"a"] [,"b"] qmix$`0.5`[,"a"] qmix$[,"b"] $varied[,"a"] $[,"b"]
#>   <dist>     <dbl>  <dbl>            <dbl>       <dbl>         <dbl>   <dbl>
#> 1 MVN[2]         1      2                1           2          2.68    3.46
#> 2 MVN[2]         3      5                3           5          4.88    6.68
#> # … with 3 more variables: density(dist, cbind(2, 3)) <dbl>,
#> #   mean(dist) <dbl[,2]>, variance(dist) <list>

^{Created on 2021-07-26 by the reprex package (v2.0.0)}

mjskay commented 3 years ago

Very cool! nice generalization that will work for my purposes anyway :)

davidtedfordholt commented 3 years ago

Certainly meets my original request! I got momentarily bored and did some speed testing on the distribution creation using

tibble(dist = dist_normal(1:10000, 1:10000)) %>% 
    mutate(mu = mean(dist), sigma = sqrt(variance(dist))) %>%
    mutate(dist = dist_normal(mu, sigma)) %>% 
    mutate(mu = mean(dist), sigma = sqrt(variance(dist))) %>%
    mutate(dist = dist_normal(mu, sigma)) %>%  ...

Nothing meaningful there, but got me wondering if there is a compelling reason not to have a std_dev() or the like (when appropriate). I assume it's bad form to just create a generic sd() that defaults to stats::sd(), though that would certainly be my first thought. (feel free to ignore and close, as that's no big deal!)

mjskay commented 2 years ago

Thanks again for this. Sorry, I'm just getting around to taking a closer look now and finding some issues with my typical use case, which is generating a vector of a lots of input points for a single distribution. I'm wondering if I'm missing the "right" way to do this or if the interface could be extended to support this use case more effectively.

Basically the question is: if I have a single distribution and I want a large number of either quantiles, density values, or cdf values from it as a vector, what is the best way to do that?

Neither of the solutions I have come up with feel satisfactory. They were:

1. Convert the input quantiles to a list

The simplest form is something like this, though it generates a bunch of useless names:

quantile(x, as.list(ppoints(100)))
## New names:
## * `` -> ...1
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5
## * ...
##        ...1      ...2      ...3      ...4      ...5      ...6      ...7      ...8
## 1 -4.151659 -3.340181 -2.919928 -2.623821 -2.390795 -2.196386 -2.028204 -1.879063
##        ...9     ...10     ...11     ...12     ...13     ...14     ...15     ...16
## 1 -1.744408 -1.621158 -1.507131 -1.400718 -1.300699 -1.206125 -1.116243 -1.030444
##        ...17      ...18      ...19      ...20      ...21      ...22      ...23
## 1 -0.9482278 -0.8691786 -0.7929467 -0.7192347 -0.6477873 -0.5783833 -0.5108301
##        ...24      ...25      ...26     ...27      ...28     ...29       ...30
## 1 -0.4449581 -0.3806176 -0.3176754 -0.256012 -0.1955203 -0.136103 -0.07767206
##         ...31     ...32      ...33    ...34     ...35     ...36     ...37     ...38
## 1 -0.02014691 0.0365463 0.09247562 0.147704 0.2022899 0.2562878 0.3097489 0.3627213
##       ...39     ...40     ...41     ...42     ...43    ...44     ...45     ...46
## 1 0.4152502 0.4673788 0.5191479 0.5705969 0.6217631 0.672683 0.7233916 0.7739229
##       ...47     ...48     ...49     ...50    ...51    ...52    ...53   ...54    ...55
## 1 0.8243103 0.8745864 0.9247834 0.9749331 1.025067 1.075217 1.125414 1.17569 1.226077
##      ...56    ...57    ...58    ...59    ...60    ...61   ...62    ...63    ...64
## 1 1.276608 1.327317 1.378237 1.429403 1.480852 1.532621 1.58475 1.637279 1.690251
##      ...65   ...66    ...67    ...68    ...69    ...70    ...71    ...72   ...73
## 1 1.743712 1.79771 1.852296 1.907524 1.963454 2.020147 2.077672 2.136103 2.19552
##      ...74    ...75    ...76    ...77   ...78    ...79    ...80    ...81    ...82
## 1 2.256012 2.317675 2.380618 2.444958 2.51083 2.578383 2.647787 2.719235 2.792947
##      ...83    ...84    ...85    ...86    ...87    ...88    ...89    ...90    ...91
## 1 2.869179 2.948228 3.030444 3.116243 3.206125 3.300699 3.400718 3.507131 3.621158
##      ...92    ...93    ...94    ...95    ...96    ...97    ...98    ...99   ...100
## 1 3.744408 3.879063 4.028204 4.196386 4.390795 4.623821 4.919928 5.340181 6.151659

I could set the names of the input list of quantiles to something more sensible on the way in, but that's kind of a pain, especially for what I think should be a relatively simple operation (also because I would just be setting names to avoid the warning only to discard them again later, as I really just need a vector output here).

This is also a few orders of magnitude slower than just calling the corresponding quantile function (not sure why, maybe the internal code doesn't take advantage of the underlying function's vectorization?):

bench::mark(quantile(x, as.list(ppoints(100))))
## New names:
## * `` -> ...1
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5
## * ...
## # A tibble: 1 x 13
##   expression                             min  median `itr/sec` mem_alloc `gc/sec` n_itr
##   <bch:expr>                         <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl> <int>
## 1 quantile(x, as.list(ppoints(100)))  2.54ms  2.73ms      351.     118KB     4.15   169
## # ... with 6 more variables: n_gc <dbl>, total_time <bch:tm>, result <list>,
## #   memory <list>, time <list>, gc <list>

Compare that to this:

bench::mark(qnorm(ppoints(100), 1, 4))
## # A tibble: 1 x 13
##   expression                     min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
##   <bch:expr>                <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
## 1 qnorm(ppoints(100), 1, 4)    5.1us      6us   154509.    2.09KB        0 10000     0
## # ... with 5 more variables: total_time <bch:tm>, result <list>, memory <list>,
## #   time <list>, gc <list>

If we're then talking visualizing a bunch of distributions each with several hundred quantiles, or density values, or cdf values, a ~500x slowdown is not great.

2. Manually replicate the distribution vector out to the length of the desired quantiles

Something like this:

quantile(rep(x, 100), ppoints(100))
##   [1] -4.15165861 -3.34018076 -2.91992797 -2.62382135 -2.39079542 -2.19638628
##   [7] -2.02820378 -1.87906294 -1.74440762 -1.62115822 -1.50713088 -1.40071772
##  [13] -1.30069876 -1.20612511 -1.11624324 -1.03044407 -0.94822775 -0.86917858
##  [19] -0.79294673 -0.71923473 -0.64778726 -0.57838331 -0.51083005 -0.44495810
##  [25] -0.38061765 -0.31767539 -0.25601203 -0.19552025 -0.13610300 -0.07767206
##  [31] -0.02014691  0.03654630  0.09247562  0.14770398  0.20228987  0.25628782
##  [37]  0.30974894  0.36272127  0.41525021  0.46737877  0.51914794  0.57059686
##  [43]  0.62176315  0.67268303  0.72339158  0.77392292  0.82431032  0.87458644
##  [49]  0.92478342  0.97493306  1.02506694  1.07521658  1.12541356  1.17568968
##  [55]  1.22607708  1.27660842  1.32731697  1.37823685  1.42940314  1.48085206
##  [61]  1.53262123  1.58474979  1.63727873  1.69025106  1.74371218  1.79771013
##  [67]  1.85229602  1.90752438  1.96345370  2.02014691  2.07767206  2.13610300
##  [73]  2.19552025  2.25601203  2.31767539  2.38061765  2.44495810  2.51083005
##  [79]  2.57838331  2.64778726  2.71923473  2.79294673  2.86917858  2.94822775
##  [85]  3.03044407  3.11624324  3.20612511  3.30069876  3.40071772  3.50713088
##  [91]  3.62115822  3.74440762  3.87906294  4.02820378  4.19638628  4.39079542
##  [97]  4.62382135  4.91992797  5.34018076  6.15165861

This works, gives me a vector, and is faster (on analytical distributions) than the previous attempt (though still ~100x slower than calling the base function directly):

bench::mark(quantile(rep(x, 100), ppoints(100)))
## # A tibble: 1 x 13
##   expression                              min median `itr/sec` mem_alloc `gc/sec` n_itr
##   <bch:expr>                          <bch:t> <bch:>     <dbl> <bch:byt>    <dbl> <int>
## 1 quantile(rep(x, 100), ppoints(100))   664us  722us     1318.    10.9KB     8.50   620
## # ... with 6 more variables: n_gc <dbl>, total_time <bch:tm>, result <list>,
## #   memory <list>, time <list>, gc <list>

But there are some issues, particularly with sample-based distributions (dist_sample()). For these, it is slower and requires more memory:

x = dist_sample(list(qnorm(ppoints(1000))))

bench::mark(quantile(rep(x, 100), ppoints(100)))
## # A tibble: 1 x 13
##   expression                              min median `itr/sec` mem_alloc `gc/sec` n_itr
##   <bch:expr>                          <bch:t> <bch:>     <dbl> <bch:byt>    <dbl> <int>
## 1 quantile(rep(x, 100), ppoints(100))  10.5ms 11.2ms      88.2    3.48MB     15.1    35
## # ... with 6 more variables: n_gc <dbl>, total_time <bch:tm>, result <list>,
## #   memory <list>, time <list>, gc <list>

Compared to using quantile() directly on the underlying sample, we're talking a 100x slowdown in this instance:

bench::mark(quantile(qnorm(ppoints(1000)), ppoints(100)))
## # A tibble: 1 x 13
##   expression                                        min   median `itr/sec` mem_alloc
##   <bch:expr>                                   <bch:tm> <bch:tm>     <dbl> <bch:byt>
## 1 quantile(qnorm(ppoints(1000)), ppoints(100))    198us    211us     4397.    63.2KB
## # ... with 8 more variables: gc/sec <dbl>, n_itr <int>, n_gc <dbl>,
## #   total_time <bch:tm>, result <list>, memory <list>, time <list>, gc <list>

Also, while solution 1 also works with the density() and cdf() functions, it does not seem to work with density() on dist_sample() objects specifically (possibly this is just a bug in density.dist_sample() when given only one point?):

x = dist_sample(list(qnorm(ppoints(1000))))
density(rep(x, 100), ppoints(100))
## Error in stats::approx(d$x, d$y, xout = at) : 
##   need at least two non-NA values to interpolate
## In addition: Warning message:
## In regularize.values(x, y, ties, missing(ties), na.rm = na.rm) :
##   collapsing to unique 'x' values

A suggested solution

As a user, I would have expected something like this to work:

x = dist_normal(1, 4)
quantile(x, ppoints(100))
## Error: Cannot recycle input of size 100 to match the distributions (size 1).
## Run `rlang::last_error()` to see where the error occurred.

I think this would be a natural generalization of the broadcasting you are already doing --- when the input distribution vector has size 1, it makes sense to "broadcast" it to length of the quantile vector (I put "broadcast" in quotes because the underlying implementation should probably not actually broadcast it for efficiency's sake, but the output would be as if it were broadcast). Then the output would be a vector the same length as the input quantiles.

mitchelloharawild commented 2 years ago

Thanks for the comment, I can see this being a problem for ggdist.

I believe (without yet profiling), the performance issues are a result of tibble casting and mapping over distributions (#25). There was also an issue I've now fixed specific to how I vectorised quantile(<dist_sample>) - 8ca5c9f285c6fb6e08ea3b8ece74abcc3a59fc65 I've also fixed the issue you pointed out with density(<dist_sample>, at = <numeric[1]>) - 7495d9c877f8e4e0a36fe5075dd0ef6da1661590

I also think that your broadcasting suggestion is reasonable, but is tangential to the issue. I think the main issue here is that a more efficient/compact output is needed for when large numbers of quantiles/stats are computed from a distribution. The wide format is nice for exploration and working with distributions in tibbles, but for programming it is likely nicer to return a list column. I'll think up some potential interfaces for this, one that comes to mind is how {dplyr} expands rows when quantile() is used and returns multiple values.

mitchelloharawild commented 2 years ago

After a bit of thought, this is where I'm stuck with a decision I'd like some feedback on.

Current behaviour

The current recycling of density(x, ppoints(100)) or equivalently, density(x, list(ppoints(100))) is that each value in ppoints(100) is applied to each element of x.

So density(x[1], ppoints(100)[1]), density(x[2], ppoints(100)[2]), ...

This is similar to how stats::ddist() functions work (which isn't necessarily good).

Hence, density(<dist[1]>, at = <double[100]>) gives the error:

Error: Cannot recycle input of size 100 to match the distributions (size 1)

As distributions are never automatically recycled.

Alternate behaviour

The alternative which would allow your suggestion to work would be not spreading the input parameter across multiple distributions. For the same suggestion of density(x, ppoints(100)), you would get: density(x[1], ppoints(100)), density(x[2], ppoints(100), ...

and the output would likely be a list column of numerics, list(<numeric[100]>, <numeric[100]>, ...) of the same length as the number of distributions.

If the change is made, which I think is a nicer use of recycling making for a friendlier i/o, then how would you compute statistics at different points across several distributions? Or is this usage rarely wanted, and we require the user to work harder for it. One approach of obtaining the current behaviour with the alternate recycling suggestion would be something like map2_dbl(dist, at, quantile).

mjskay commented 2 years ago

Ah sorry, I wasn't clear - that's not the change I was suggesting. What I am suggesting should not change the behavior of any cases already supported.

If I understand your current approach correctly, in pseudo-code it would be something like this (where vec_size on data frames is in the vctrs sense: it counts rows):

Given quantile(x, probs) with probs a numeric, a list, or a data frame...

If probs is a list convert it to a data frame
If vec_size(probs) < vec_size(x) broadcast probs to the size of x
If vec_size(x) < vec_size(probs) throw an error
Compute quantiles in parallel along x and the rows of probs, returning an object the same shape as probs (ie a numeric or a data frame)

My suggestion is to replace step 3 with "if vec_size(x) < vec_size(probs) broadcast x to the size of probs". At a very high level this makes the full algorithm be "broadcast x and probs to a common size then compute quantiles", and merely makes what currently returns an error in your algorithm do something else without changing other behavior.

mitchelloharawild commented 2 years ago

The current behaviour is not yet on CRAN, so happy to change it for the better (especially if it is still possible to achieve the current results with other not-too-bad code).

No special handling of list/df is used, however from a design perspective with the output structure, yes.
Almost, it follows vctrs recycling rules (https://vctrs.r-lib.org/reference/vec_recycle.html#recycling-rules). So probs<2> won't recycle to dist<3>.
Yes. The distribution is not recycled (one-sided recycling rules). Similar to what you mention, perhaps this could be recycled if vec_size(dist)==1. However I think this isn't the solution we're looking for as it doesn't fix vec_size(dist)>1 which is the core use-case of the package.
Essentially, yes.

I don't think recycling/broadcasting the distribution to match probs fixes this issue nicely, especially when considering rectangular input/output. Having the output size always match the distribution size is nice for ease of analysis.

mjskay commented 2 years ago

However I think this isn't the solution we're looking for as it doesn't fix vec_size(dist)>1 which is the core use-case of the package.

But this is already handled in step 2, no? When vec_size(dist) > 1 the only valid sizes for probs must be 1 or vec_size(dist) under vctrs recycling rules, which I agree are sensible.

I think I'm not clear on your objection to my proposed modification to step 3, since it is completely compatible with your proposal and the sets of inputs the two cover are disjoint. My other two pitches for it would be it is a further generalization of your existing proposal (two way broadcasting instead of one way) and from a user perspective coming from the base API it is an expected behavior (basically, it's weird that there's no easy equivalent to qnorm(ppoints(100), 0, 1) or quantile(x, ppoints(100)) in this API).

mitchelloharawild commented 2 years ago

But this is already handled in step 2, no? When vec_size(dist) > 1 the only valid sizes for probs must be 1 or vec_size(dist) under vctrs recycling rules, which I agree are sensible.

Yes, it is covered by the recycling rules in 2. However quantile(<dist[3]>, <numeric[100]>) would not be able to recycle and should error. I don't see how adding recycling for <dist> of length 1 would make it easy to do what you want.

basically, it's weird that there's no easy equivalent to qnorm(ppoints(100), 0, 1) or quantile(x, ppoints(100)) in this API

Completely agree and this needs to be fixed. Not only that, but I think this is a common use-case and should be easy to do with this package's interface (at least more commonly used than qnorm(0:1, 0:1, 1) giving c(qnorm(0,0,1), qnorm(1,1,1)). I think the change proposed in 'Alternate behaviour' fixes this, by instead of recycling inputs, they are mapped to each distribution. So quantile(dist_normal(0:1, 1), 0:1) will give list(qnorm(0:1, 0, 1), qnorm(0:1, 1, 1).

mjskay commented 2 years ago

However quantile(<dist[3]>, <numeric[100]>) would not be able to recycle and should error.

Ah I misunderstood - I was assuming it makes sense for this case to throw an error, and you wanted to come up with a way for it not to.

I don't see how adding recycling for of length 1 would make it easy to do what you want.

It solves my problem in ggdist because (for various reasons to do with its internals that are very unlikely to change) the typical way stuff is called is many probs on one distribution at a time, like the examples I gave in https://github.com/mitchelloharawild/distributional/issues/52#issuecomment-939673498 . I don't need many dists x many probs so I wasn't really thinking about that.

However, I see how what you're proposing as an alternative solution would work while dropping support for quantiles varying over distributions (which seems perhaps unlikely to be used). This would certainly work for my use case as well. In fact an approach fairly similar to this (but in matrix form) is what I went for with posterior::rvar (https://mc-stan.org/posterior/reference/rvar-dist.html). The approach there puts the length of probs first (which makes sense for rvars because that's where the draws dimension is but doesn't make sense in distributional), and a matrix format probably wouldn't work here because different elements of the input vector could have different dims. So I see the value of the list-of-numeric approach you propose.

mitchelloharawild commented 2 years ago

It solves my problem in ggdist because (for various reasons to do with its internals that are very unlikely to change) the typical way stuff is called is many probs on one distribution at a time, like the examples I gave in #52 (comment) . I don't need many dists x many probs so I wasn't really thinking about that.

I figured this is how you were planning to use dist recycling. I think the need for this also applies to end users, and they shouldn't need to map over the distributions like you'll be doing in ggdist.

However, I see how what you're proposing as an alternative solution would work while dropping support for quantiles varying over distributions (which seems perhaps unlikely to be used). This would certainly work for my use case as well. In fact an approach fairly similar to this (but in matrix form) is what I went for with posterior::rvar (https://mc-stan.org/posterior/reference/rvar-dist.html). The approach there puts the length of probs first (which makes sense for rvars because that's where the draws dimension is but doesn't make sense in distributional), and a matrix format probably wouldn't work here because different elements of the input vector could have different dims. So I see the value of the list-of-numeric approach you propose.

I'll look into what you've done there, and implement what I'm describing above to play around with more concrete examples.

mitchelloharawild commented 2 years ago

I think I've come up with a good compromise. I like how you described that if the argument is a data frame, then the output is a data frame with 1:1 mapping of the distribution and parameter. So I've kept this behaviour, which also allows you to vary the argument across distributions using the data frame input. This data frame is recycled as before, so if you want to compare a few quantiles in wide format, you can do it with tibble(0.5, 0.9).

If the input is not a data frame, the parameter is mapped across each distribution. The only case that changes here is fn(<dist[m]>, <numeric[m]>): Previously it would compute: fn(<dist>[1], <numeric>[1]), fn(<dist>[2], <numeric>[2]), ... Now it computes: fn(<dist>[1], <numeric[m]>), fn(<dist>[2], <numeric[m]>), ...

I think this makes the computing many positions on densities/quantiles/etc nicer. Let me know if this looks alright to you.

mitchelloharawild commented 2 years ago

library(tidyverse)
library(distributional)

# <dist> is a distribution vector of length 10
dist <- dist_normal(1:10, 1:10)

# p = 0.5:          apply p across all elements of <dist> (recycling p onto <dist>)
#                   Returns a vector of length 10
quantile(dist, 0.5)
#>  [1]  1  2  3  4  5  6  7  8  9 10

# p = c(0.5, 0.9):  maps p (length 2) onto each <dist> (length 10)
quantile(dist, c(0.5, 0.9))
#> [[1]]
#> [1] 1.000000 2.281552
#> 
#> [[2]]
#> [1] 2.000000 4.563103
#> 
#> [[3]]
#> [1] 3.000000 6.844655
#> 
#> [[4]]
#> [1] 4.000000 9.126206
#> 
#> [[5]]
#> [1]  5.00000 11.40776
#> 
#> [[6]]
#> [1]  6.00000 13.68931
#> 
#> [[7]]
#> [1]  7.00000 15.97086
#> 
#> [[8]]
#> [1]  8.00000 18.25241
#> 
#> [[9]]
#> [1]  9.00000 20.53396
#> 
#> [[10]]
#> [1] 10.00000 22.81552

# p = ppoints(10):  maps p to each element of <dist>
quantile(dist, ppoints(10))
#> [[1]]
#>  [1] -0.5466352714 -0.0004905456  0.3445764948  0.6245382298  0.8774191561
#>  [6]  1.1225808439  1.3754617702  1.6554235052  2.0004905456  2.5466352714
#> 
#> [[2]]
#>  [1] -1.0932705428 -0.0009810912  0.6891529895  1.2490764595  1.7548383122
#>  [6]  2.2451616878  2.7509235405  3.3108470105  4.0009810912  5.0932705428
#> 
#> [[3]]
#>  [1] -1.639905814 -0.001471637  1.033729484  1.873614689  2.632257468
#>  [6]  3.367742532  4.126385311  4.966270516  6.001471637  7.639905814
#> 
#> [[4]]
#>  [1] -2.186541086 -0.001962182  1.378305979  2.498152919  3.509676624
#>  [6]  4.490323376  5.501847081  6.621694021  8.001962182 10.186541086
#> 
#> [[5]]
#>  [1] -2.733176357 -0.002452728  1.722882474  3.122691149  4.387095781
#>  [6]  5.612904219  6.877308851  8.277117526 10.002452728 12.733176357
#> 
#> [[6]]
#>  [1] -3.279811628 -0.002943274  2.067458969  3.747229379  5.264514937
#>  [6]  6.735485063  8.252770621  9.932541031 12.002943274 15.279811628
#> 
#> [[7]]
#>  [1] -3.826446900 -0.003433819  2.412035463  4.371767608  6.141934093
#>  [6]  7.858065907  9.628232392 11.587964537 14.003433819 17.826446900
#> 
#> [[8]]
#>  [1] -4.373082171 -0.003924365  2.756611958  4.996305838  7.019353249
#>  [6]  8.980646751 11.003694162 13.243388042 16.003924365 20.373082171
#> 
#> [[9]]
#>  [1] -4.919717443 -0.004414911  3.101188453  5.620844068  7.896772405
#>  [6] 10.103227595 12.379155932 14.898811547 18.004414911 22.919717443
#> 
#> [[10]]
#>  [1] -5.466352714 -0.004905456  3.445764948  6.245382298  8.774191561
#>  [6] 11.225808439 13.754617702 16.554235052 20.004905456 25.466352714

# p = tibble(0.5):  equivalent to p = 0.5, but returns a data frame output
#                   Returns a tibble with 1 vector column of length 10
quantile(dist, tibble(0.5))
#>    0.5
#> 1    1
#> 2    2
#> 3    3
#> 4    4
#> 5    5
#> 6    6
#> 7    7
#> 8    8
#> 9    9
#> 10  10

# p = list(c(0.5, 0.9)):
#                   Cannot recycle p (size 2) onto <dist> (length 10)
#                   Returns an error.
quantile(dist, tibble::tibble(c(0.5, 0.9)))
#> Error: Cannot recycle input of size 2 to match the distributions (size 10).

# p = list(p1, p2): 1:1 calculation of inputs to distributions.
#                   Returns a tibble with 2 vector columns and 10 rows, with 
#                   values corresponding to the parameter in the input tibble
#                   Names of p are used in output.
quantile(dist, tibble(ppoints(10), runif(10)))
#>      ppoints(10)  runif(10)
#> 1  -0.5466352714  0.5003449
#> 2  -0.0009810912  3.5584573
#> 3   1.0337294843  4.1129117
#> 4   2.4981529191  4.4268746
#> 5   4.3870957806  6.7826469
#> 6   6.7354850633 -2.5078333
#> 7   9.6282323916  4.3891666
#> 8  13.2433880419 14.9857666
#> 9  18.0044149106 -5.3955708
#> 10 25.4663527140 15.3750286

In a data frame workflow

tibble(dist = dist) %>% 
  mutate(quantile(dist, 0.5))
#> # A tibble: 10 × 2
#>          dist `quantile(dist, 0.5)`
#>        <dist>                 <dbl>
#>  1    N(1, 1)                     1
#>  2    N(2, 4)                     2
#>  3    N(3, 9)                     3
#>  4   N(4, 16)                     4
#>  5   N(5, 25)                     5
#>  6   N(6, 36)                     6
#>  7   N(7, 49)                     7
#>  8   N(8, 64)                     8
#>  9   N(9, 81)                     9
#> 10 N(10, 100)                    10

# Unnamed mutate() arguments are automatically unpacked
tibble(dist = dist) %>% 
  mutate(
    quantile(dist, tibble(q50 = 0.5, q90 = 0.9))
  )
#> # A tibble: 10 × 3
#>          dist   q50   q90
#>        <dist> <dbl> <dbl>
#>  1    N(1, 1)     1  2.28
#>  2    N(2, 4)     2  4.56
#>  3    N(3, 9)     3  6.84
#>  4   N(4, 16)     4  9.13
#>  5   N(5, 25)     5 11.4 
#>  6   N(6, 36)     6 13.7 
#>  7   N(7, 49)     7 16.0 
#>  8   N(8, 64)     8 18.3 
#>  9   N(9, 81)     9 20.5 
#> 10 N(10, 100)    10 22.8

# Named mutate() arguments produce named tibble colums
tibble(dist = dist) %>% 
  mutate(
    quantiles = quantile(dist, tibble(0.5, 0.9))
  )
#> # A tibble: 10 × 2
#>          dist quantiles$`0.5` $`0.9`
#>        <dist>           <dbl>  <dbl>
#>  1    N(1, 1)               1   2.28
#>  2    N(2, 4)               2   4.56
#>  3    N(3, 9)               3   6.84
#>  4   N(4, 16)               4   9.13
#>  5   N(5, 25)               5  11.4 
#>  6   N(6, 36)               6  13.7 
#>  7   N(7, 49)               7  16.0 
#>  8   N(8, 64)               8  18.3 
#>  9   N(9, 81)               9  20.5 
#> 10 N(10, 100)              10  22.8

tibble(dist = dist) %>% 
  mutate(
    quantiles_lst = quantile(dist, c(0.5, 0.9)),
    quantiles_tbl = quantile(dist, tibble(0.5, 0.9, ppoints(10))),
    cdfs = cdf(dist, tibble(0.5, 0.9)),
    densities = density(dist, tibble(0, 1)),
    mean(dist),
    variance(dist),
    hilo = hilo(dist, tibble(50, 95))
  )
#> # A tibble: 10 × 8
#>          dist quantiles_lst quantiles_tbl$`0.5` $`0.9` $`ppoints(10)` cdfs$`0.5`
#>        <dist> <list>                      <dbl>  <dbl>          <dbl>      <dbl>
#>  1    N(1, 1) <dbl [2]>                       1   2.28      -0.547         0.309
#>  2    N(2, 4) <dbl [2]>                       2   4.56      -0.000981      0.227
#>  3    N(3, 9) <dbl [2]>                       3   6.84       1.03          0.202
#>  4   N(4, 16) <dbl [2]>                       4   9.13       2.50          0.191
#>  5   N(5, 25) <dbl [2]>                       5  11.4        4.39          0.184
#>  6   N(6, 36) <dbl [2]>                       6  13.7        6.74          0.180
#>  7   N(7, 49) <dbl [2]>                       7  16.0        9.63          0.177
#>  8   N(8, 64) <dbl [2]>                       8  18.3       13.2           0.174
#>  9   N(9, 81) <dbl [2]>                       9  20.5       18.0           0.172
#> 10 N(10, 100) <dbl [2]>                      10  22.8       25.5           0.171
#> # … with 4 more variables: densities <df[,2]>, mean(dist) <dbl>,
#> #   variance(dist) <dbl>, hilo <df[,2]>

More complex input parameters (multivariate)

dist <- dist_multinomial(size = 4, prob = list(c(0.3, 0.5, 0.2)))
dimnames(dist) <- c("a", "b", "c")
density(dist, cbind(1, 2, 1))
#> [1] 0.18
density(dist, list(caseA=cbind(1, 2, 1), caseB=cbind(2, 1, 1)))
#>   caseA caseB
#> 1  0.18 0.108
density(dist, cbind(c(1,2), c(2,1), c(1,1)))
#> Error in FUN(X[[i]], ...): attempt to set 'colnames' on an object with less than two dimensions
mean(dist)
#>        a b   c
#> [1,] 1.2 2 0.8

dist <- dist_multivariate_normal(mu = list(c(1,2), c(3,5)), 
                                 sigma = list(matrix(c(4,2,2,3), ncol=2),
                                              matrix(c(5,1,1,4), ncol=2)))
dimnames(dist) <- c("a", "b")
tibble(dist) %>% 
  mutate(
    q50 = quantile(dist, c(0.5)),
    q5080 = quantile(dist, c(0.5, 0.8)),
    qmix = quantile(dist, tibble(0.5, varied = c(0.8, 0.3))),
    density(dist, cbind(2, 3)),
    mean(dist),
    variance(dist)
  )
#> # A tibble: 2 × 7
#>     dist q50[,"a"] [,"b"] q5080         qmix$`0.5`[,"a"] qmix$[,"b"] $varied[,"a"]
#>   <dist>     <dbl>  <dbl> <list>                   <dbl>       <dbl>         <dbl>
#> 1 MVN[2]         1      2 <dbl [2 × 2]>                1           2          2.68
#> 2 MVN[2]         3      5 <dbl [2 × 2]>                3           5          1.83
#> # … with 3 more variables: density(dist, cbind(2, 3)) <dbl>,
#> #   mean(dist) <dbl[,2]>, variance(dist) <list>

^{Created on 2021-10-11 by the reprex package (v2.0.0)}

mjskay commented 2 years ago

I like this, this is a great compromise! Nicely separates the two workflows. I could see extending posterior::rvar to support data frame input in the same way too.

Thanks for working this out and being so accommodating!

mjskay commented 2 years ago

FYI I've updated the github version of {ggdist} to support this in a future-proof way so that the CRAN and github versions of {distributional} should both work with the github version of {ggdist} now.

If you like I can submit ggdist to CRAN sometime this week so you don't run into revdep problems when you submit, unless there are other changes you think I should wait on to test?

mitchelloharawild commented 2 years ago

Great! Let's wait for a bit because I'd like to migrate the graphics in fable to use ggdist for this release, and it'll be a good test for the changes.

mjskay commented 2 years ago

Sounds good, I'm in no rush.

njtierney commented 2 years ago

Just wanted to chime in to say I really like this interface!

mitchelloharawild commented 2 years ago

Closing as this has been added.

mitchelloharawild / distributional

Vectorisation of distributions and operation arguments #52

Broadcasting rules of quantile(<dist>, p)

1. Convert the input quantiles to a list

2. Manually replicate the distribution vector out to the length of the desired quantiles

A suggested solution

Current behaviour

Alternate behaviour

In a data frame workflow

More complex input parameters (multivariate)

Broadcasting rules of quantile(`<dist>`, p)