stan-dev / posterior

The posterior R package
https://mc-stan.org/posterior/
Other
167 stars 23 forks source link

Feature request: summarise specific variables with specific functions in summarise_draws #105

Open n-kall opened 3 years ago

n-kall commented 3 years ago

Currently summarise_draws applies the summary functions to all variables. In some cases it could be useful to summarise specific variables with specific functions, akin to dplyr:summarise. This would also allow for summary functions to take multiple variables as functions.

A simple example usecase might be calculating the KL-divergence between the distributions of two parameters in the draws object, e.g.:

# draws has variables 'x' and 'y'

# currently doesn't work because it tries to apply the summary function to all variables
summarise_draws(draws, kl_xy = kl_div(x, y))
paul-buerkner commented 3 years ago

I agree, this is something we need. From how I see it now, we have two options to make that possible:

1) Add an option to summarise_draws to change modes between the "all variables and univariate measures" (current= and "selected variables applied selectively to multivariate measures" mode (new). 2) add a new function with a different name for the new mode. Of course, we should also discuss which of the two modes should eventually have the name summarise_draws.

mjskay commented 3 years ago

I get the need for some way to calculate multivariate measures, but it occurs to me the selection part of this question is maybe overloading summarise_draws to do more than it needs to --- I'm always suspicious of functions that start accumulating a lot of functionality and options. Usually it's a sign the API needs to be decomposed in some way. Perhaps variable selection is better handled by doing subset_draws prior to summarise_draws? Unless I'm mistaken, in most cases where different variables are summarised differently the current format of output tables in summarise_draws won't conform so you wouldn't easily be able to automatically output them in a single table anyway.

paul-buerkner commented 3 years ago

I agree with @mjskay that we should add another function (option 2) to not mess up the cleanness of the API.

I am a little unsure about the rest of your comments. The point is not necessarily to perform selection on the fly and I agree this is better handled by adding a subset_draws to the pipe.

What I envision the new interface to handle are multivariate input (draw vectors for multiple variables) and scalar output, resempling the interface of dplyr::summarise. Common examples could be correlating variables, but also more special applications such as the one @n-kall mentioned above. Of course people could also use it for simple things such as mean(x) if they are used to the dplyr::summarise interface. Not sure if these applications justify being added to posterior in the first place, but my current understanding is that there are some relevant applications.

mjskay commented 3 years ago

I am a little unsure about the rest of your comments. The point is not necessarily to perform selection on the fly and I agree this is better handled by adding a subset_draws to the pipe.

The point I was trying to make is that there's some spectrum between a maximalist API that tries to incorporate every feature into big functions and one that is structured in such a way that users can come up with combinations of simple operations that accomplish the tasks they are trying to accomplish, as well as tasks you haven't thought of in designing the API.

What I envision the new interface to handle are multivariate input (draw vectors for multiple variables) and scalar output, resempling the interface of dplyr::summarise. Common examples could be correlating variables, but also more special applications such as the one @n-kall mentioned above. Of course people could also use it for simple things such as mean(x) if they are used to the dplyr::summarise interface. Not sure if these applications justify being added to posterior in the first place, but my current understanding is that there are some relevant applications.

Makes sense. My gut reaction was to think that this kind of summarization is pretty easy to do with existing APIs, hence my reticence at re-inventing it. E.g. an as_draws_df() followed by a summarise() would do it, and if you did a subset_draws() first to the relevant variables it would likely not be slow (and for folks who don't like dplyr you could use aggregate() from base instead).

On the other hand I can see an argument for a consistent API across draws data types for doing this kind of thing (perhaps if the operations are very common?). I suppose I don't have a strong opinion either way, but wanted to raise a point of discussion about how "big" the posterior API is intended to be.

paul-buerkner commented 3 years ago

Yes, you are right, one could move to dplyr for this purpose. Hmm, I will also have to think of this more whether adding such funcationality for all formats would be worthwile and how it should look.

jsocolar commented 3 years ago

This is a throwaway comment since it seems like you're all on the same page anyway, but just to mention that in addition to the other concerns, attempting to shoehorn this inside summarise_draws would break the parallelized version if relevant variables ended up on different chunks.