mjskay / tidybayes

Bayesian analysis + tidy data + geoms (R package)
http://mjskay.github.io/tidybayes
GNU General Public License v3.0
718 stars 59 forks source link

Non-standard use of non-standard evaluation #199

Open lionel- opened 5 years ago

lionel- commented 5 years ago

I think capturing expressions in functions like spread_draws() obscures the UI in tidybayes. For instance in:

m %>% spread_draws(condition_mean[condition], response_sd)

It makes sense that we are selecting condition_mean from the set of existing model parameters. But while condition_mean selects a parameter, condition on the other hand defines a new column.

In general, it is best keep NSE purely for data-masking, so that all NSE interfaces share a consistent feel and it is easy for users to pick up new packages. That's why all the new functions in tidyr take a string to define new columns, rather than unquoted column names.

lionel- commented 5 years ago

It took me a while to find about tidy_draws(). Perhaps it'd make sense for spread_draws() without argument to select everything?

It feels like a tidyselect backend would make sense for selecting parameters. Combined with the suggestion above, we'd have

fit %>%
  spread_draws(
    starts_with("sd_"),
    indices = c(sd_alpha = "i", sd_beta = c("j", "k"))
  )

This would select all parameters starting with sd_, with an aggregation spec corresponding to sd_alpha[i], sd_beta[j, k] with the existing UI.

mjskay commented 5 years ago

I like the idea of spread_draws() without arguments giving tidy_draws()-like output. Made an issue here: #200

I see the potential value of a more tidyverse-style NSE in spread_draws/gather_draws, but there's a tension here in that there are two competing mental models I am trying to support with tidybayes: the tidyverse and probabilistic programming languages like Stan/JAGS/etc. I generally assume users have some mental model that includes each paradigm. I see spread_draws/gather_draws as the interface between those two worlds: I want everything else in the tidybayes API to follow the tidyverse model, but the little bit of code inside spread_draws( ... ) is intended to match more closely with a user's mental model when thinking about Stan/JAGS/etc model specifications. I think this makes it easier to use: sd_alpha[i] is exactly the sort of thing you would write in Stan (for example), and spread_draws() does the work of figuring out how to tidy that up so that you can write everything else outside of it in tidyverse style. Then, the rest of the tidybayes API follows (or at least should follow) tidyverse style.

That said, I think adding alternate syntax to spread/gather_draws that uses tidyselect (maybe something like spread_draws_at() to parallel mutate_at()/etc) would be useful. Sometimes you have models with weirder parameter name output that requires some finagling with spread_draws which might be made easier with tidyselect syntax.

Re: variables versus indices, I could see changing the index syntax to use character vectors for index names to better match the distinction between existing variables / created ones.

lionel- commented 5 years ago

I think the intention makes sense. However, in a Stan model, the [ notation is for using indices, not declaring them. Indices are declared separately via the for loop notation.

About _at variants, note that there are long term plans to make tidyselections and predicate selections available inside the main verbs, in order to avoid the multiplication of functions in dplyr. Instead of mutate_at(vars(starts_with("foo")), fn), you'd do mutate(mapping(starts_with("foo"), fn)). This doesn't mean that _at variants will be irrelevant in all cases, but that's something to think about.

Note that in principle you could support both tidyselect and the current notations.

mjskay commented 5 years ago

I think the intention makes sense. However, in a Stan model, the [ notation is for using indices, not declaring them. Indices are declared separately via the for loop notation.

Right, the intention is to mimic the usage syntax, not the declaration syntax (frankly, Stan's declaration syntax is a bit convoluted owing to its C++ heritage; modern programming languages seem to avoid the terrible inside-out index declaration order that you get in C/C++, yet Stan adopted it). The usage syntax, which is sensible, is also the same syntax for variable usage in JAGS, and is the same syntax you see in the column names themselves in posterior matrices and data frames (and the output you get from get_variables()). The idea is to capitalize on that consistent interface using that same usage syntax as a quick way to get data from that world into the tidyverse. A number of users have told me spread_draws is one of the best parts of the package, which I think is a result of the fact that its interface is able to capitalize on that existing domain knowledge.

About _at variants, note that there are long term plans to make tidyselections and predicate selections available inside the main verbs, in order to avoid the multiplication of functions in dplyr. Instead of mutate_at(vars(starts_with("foo")), fn), you'd do mutate(mapping(starts_with("foo"), fn)). This doesn't mean that _at variants will be irrelevant in all cases, but that's something to think about.

Note that in principle you could support both tidyselect and the current notations.

I like the idea of supporting a tidyselect interface with a function that could used in place of the symbolic specification for a variable. That would be an easy path forward that would also require less re-architecting of the thing, since variable spec parsing is already separate from all the rest of the logic of spread_draws/gather_draws. Is there an issue or something about the future plans for replacing the _at functions so I can track the plans for those changes in order to be consistent with them?