Consider ggdist integration for distribution visualisation

mitchelloharawild commented 4 years ago

{ggdist} provides an impressive variety of plot styles for visualising distributions, and in many cases produces similar / better results to what I plan on adding for distributional (#6, #18). I also like having the graphics for distributions in a separate package to reduce bloat in this package..

Currently (from brief observation), {ggdist} accepts character vectors to identify distribution class, and numerics for parameters. If the dist aesthetic were extended to support distribution classes (and if provided, not require arg* aesthetics), then it could be possible to retain existing functionality whilst adding support for vectorised distributions.

@mjskay - what do you think of adding support for passing this package's distribution vectors to the dist aesthetic, and is there anything additional required from this package to make it work?

mjskay commented 4 years ago

Funny, I just ran across your package while peeking at the CRAN incoming dashboard and was contemplating asking you something similar. Relatedly, is this a fork off the fable dist_() functions or are those different?

I think it should be straightforward to support these in the stat_dist family. I would hesitate to make another whole subfamily of geoms, so the most sensible thing might be to allow dist vectors passed to the dist aesthetic and just go from there. I think it would make for a great pairing!

mitchelloharawild commented 4 years ago

Funny, I just ran across your package while peeking at the CRAN incoming dashboard and was contemplating asking you something similar.

More funny is that I found your package the same way just now!

Relatedly, is this a fork off the fable dist_() functions or are those different?

It's a re-implementation of the limited fabletools::dist_*() functions to support vctrs and be more general purpose. The old classes were more of a prototype for vectorised distributions before vctrs, and only supported quantiles (as that was all we needed).

I think it should be straightforward to support these in the stat_dist family. I would hesitate to make another whole subfamily of geoms, so the most sensible thing might be to allow dist vectors passed to the dist aesthetic and just go from there. I think it would make for a great pairing!

Great, this was my line of thinking too. Please keep me in the loop for this, if you'd like help with the integration or have any questions, let me know.

mjskay commented 4 years ago

Will do!

Relatedly, something I've been prototyping for the posterior package is a multivariate random variable vctrs datatype (see https://github.com/jgabry/posterior/issues/8). I am slowly getting back around to it and planned to finish it this summer, but I don't want to end up duplicating a bunch of functionality from here. I think the design requirements might be a bit different --- there we are building a type backed by an n-dimensional array of draws from a distribution, with the goal of representing posterior distributions (and posterior predictives) and making operations on those efficient. For that representation, I don't think it would make sense to be able to combine vectors of draws with vectors from analytical distributions, as here, into the same vector. But maybe the interfaces should be similar or compatible between the two?

I guess part of this is I don't know what your plans are for multidimensional array in this package / if you have any.

mitchelloharawild commented 4 years ago

For creating a distribution of sampled values, I currently provide dist_sample(). It is currently univariate only, but it can be easily extended for multivariate use by allowing a list of matrices to be provided instead of a vector.

Here's a reproduction of the first example in the rv vignette

library(distributional)
dist <- dist_normal(mu = 1:5, sigma = 1)
rdist <- dist_sample(generate(dist, 4000))
mean(rdist)
#> [1] 1.003288 2.012888 2.995406 4.009923 4.991616
variance(rdist)
#> [1] 1.0323948 0.9953754 1.0031009 1.0279763 1.0094595
quantile(rdist, 0.025)
#> [1] -0.99105783  0.05758795  1.02219390  2.03594099  3.02509331
median(rdist)
#> [1] 1.000293 2.002336 3.001546 4.033604 5.001511
# max(rdist) # not supported, but will be possible when I add methods for distribution's domains in #8

^{Created on 2020-06-10 by the reprex package (v0.3.0)}

I certainly don't know enough about the rv package and its applications, so I'm happy for guidance in how this can be added here. With the current design of distributional, I expect that working with distributions will not use samples unless it is necessary (or unless the user uses a dist_sample()). So most of the functionality I see in the rv vignette would be achieved by storing the component distributions, and then sampling at runtime when a value is requested.

Support for multivariate random variables is currently limited and provided for interface design purposes (and as required for fable::VAR()), but generally they are implemented using lists of matrices rather than arrays. This is partly an implementation detail, but also a design choice. However I would like distribution methods to be vectorised themselves (primarily for performance), which is issue #25.

mjskay commented 4 years ago

Cool!

I looked at rv and prototyped a list of array implementation and found it inadequate performance wise. The nice thing about the full array form is you can do things like turn multiplication of random matrices into vector products on 3d arrays and do those quickly (the prototype rvar type in the corresponding branch of posterior does this). It also works well as a wrapper around n-d arrays since those are used in other algorithms that make use of posteriors, and we want a way to expose those to users easily in case they need them.

If the list representation is necessary in dist, it's possible the use cases differ enough that two implementations of a similar idea is sensible. One option would be for me to make the rvar interface conform to dist for compatibility and ensure operators do appropriate casting.

mitchelloharawild commented 4 years ago

Sounds useful.

To check my understanding: by "list of array" implementation, did you have an array for each random variable, which was then stored in a list? Then for performance reasons it is better to store the random variables together in a 'full array'? If so, this is what #25 is trying to achieve.

I don't want the list representation to be necessary for distribution methods, rather have each distribution use the most appropriate data structure for their needs. I haven't considered multiplication of distributions yet, but I can see this being added.

mjskay commented 4 years ago

Sure, to clarify --- say you want to store a 2x2 random matrix consisting of 4000 draws. Some options might be:

1 list of length 4000, where each element is a 2x2 matrix 2 list of length 4, with dim = c(2,2), where each element is a vector of length 4000 3 array of dim c(4000,2,2) or c(2,2,4000)

I first tried out 1 and rv uses 2 (if I remember correctly). Both have their pros and cons (1 is nice in that you can create distributions over arbitrary data structures). The current prototype for rvar uses 3, which tends to be better than 1 or 2 performance wise I think. However, when I first started implementing it, it was an incredible pain in the ass to get working properly with vctrs, but I think things may have stabilized in the vctrs API since then.

mitchelloharawild commented 4 years ago

Great, this is what I had thought (although I didn't know about the (2) option).

Currently distributional only supports (1), but when #25 is done (which is fairly high importance, but complex) both (2) and (3) will be possible. I agree that (3) is probably most performant of those options.

mjskay commented 4 years ago

Very cool! Are you interested in help / feedback on that? Now that the tidybayes/ggdist split is underway my next task was to resurrect my rvar prototype and finish off the last bits to get it working with vctrs. It's currently on a stale branch in posterior.

If the dist datatype is able to work well for posterior I think that would be very cool and might also save me some work (heh :) ). I'm pinging @jgabry and @paul-buerkner on this too since it's really down to what they want --- Jonah / Paul, sounds like @mitchelloharawild's package might end up implementing something like the rvar datatype alongside a more general distribution datatype that also supports analytical distributions. It is a spinoff from {fable} and company. I'm planning to support it in the new {ggdist} package (spinoff of geoms from tidybayes), but if it gets an array-of-draws implementation it could be perfect for what we need.

mitchelloharawild commented 4 years ago

Always happy for feedback and contributions. #25 is probably best for me to work through, but once this is done we can look into how rv functionality can be added.

mjskay commented 4 years ago

Sounds great. I'll continue working on rvar on my side then and ping back around as I go and we can see how things could work together. :)

mitchelloharawild commented 4 years ago

Closing as {ggdist} is amazing and now works great with this package. Other discussion is referenced in related issues.

mitchelloharawild / distributional

Consider ggdist integration for distribution visualisation #24