mjskay / tidybayes

Bayesian analysis + tidy data + geoms (R package)
http://mjskay.github.io/tidybayes
GNU General Public License v3.0
710 stars 59 forks source link

Custom reserved variables #323

Closed jfsalzmann closed 2 months ago

jfsalzmann commented 2 months ago

I noticed that I cannot modify reserved variables.

I'm looking for a way that allows me passing information on whether a draw is from warmup or sampling phase on to gather_draws / spread_draws output. I failed. I tried to base this on @mjskay's suggestion in #236 to use posterior's draws formats.

Here is how far I have come (experimental approach, just so I can understand what limits this on the way):

reserved_df_variables = function() 
{
  c(".chain", ".iteration", ".draw",".warmup")
}
assignInNamespace("reserved_df_variables",reserved_df_variables,"posterior")

and in my custom draws extraction pipe, in the end I do

posterior = [...] %>%
  mutate(.warmup = ifelse(warmup_saved & inc_warmup & .iteration<sampling_draws, 1, 0)) %>%
  setattr("class", posterior:::class_draws_df())

so that I get a draws_df object that correctly "understands" (or just prints?) .warmup to be the respective indicator:

> posterior
# A draws_df: 1500 iterations, 4 chains, and 1 variables
     mu_y
1   1.610
2   1.610
3   1.610
4   1.611
5   1.612
6   0.251
7   0.274
8  -0.271
9  -0.078
10 -0.027
# ... with 5990 more draws
# ... hidden reserved variables {'.chain', '.iteration', '.draw', '.warmup'}

Both gather_draws / spread_draws will remove the column, and having had a look on the source code, I see no straigh forward workaround as the reserved names seem hardcoded in many different places (even inconsistently, with gather_variables for instance also reserving .rows by default, which made me hope initially I can just use this one, but appearently in other places .rows is not reserved either - however, interestingly, when using .rows, gather_variables returns the expected result).

> posterior %>% gather_draws(mu_y)
# A tibble: 6,000 × 5
# Groups:   .variable [1]
   .chain .iteration .draw .variable  .value
    <int>      <int> <int> <chr>       <dbl>
 1      1          1     1 mu_y       1.61  
 2      1          2     2 mu_y       1.61  
 3      1          3     3 mu_y       1.61  
 4      1          4     4 mu_y       1.61  
 5      1          5     5 mu_y       1.61  
 6      1          6     6 mu_y       0.251 
 7      1          7     7 mu_y       0.274 
 8      1          8     8 mu_y      -0.271 
 9      1          9     9 mu_y      -0.0778
10      1         10    10 mu_y      -0.0268
# ℹ 5,990 more rows
# ℹ Use `print(n = ...)` to see more rows

Expected behavior: .warmup as another column.

Can somebody think of a workaround that will still allow me to use gather_draws/spread_draws tidy select interface?

As a side note, I also noticed I get an error when coding .warmup as boolean and passing this to gather_draws/spread_draws.

mjskay commented 2 months ago

Reserved variables are not really used by {tidybayes}, as {posterior} came later and its notion of reserved variables is a bit different from what {tidybayes} is doing with .chain / .iteration / .draw. {tidybayes} is just using those variables for the purposes of indexing.

If the draws are already uniquely identified by the combination of .chain, .iteration, and .draw, you don't need .warmup to be treated as an index variable in {tidybayes}, you can just treat it as any other variable by including it in the spread_draws / gather_draws spec; for example:

library(posterior)
library(tidybayes)

example_draws() |> 
  as_draws_df() |> 
  dplyr::mutate(.warmup = rep(c(TRUE, FALSE), each = 200)) |> 
  spread_draws(theta[i], .warmup)
#> # A tibble: 3,200 × 6
#> # Groups:   i [8]
#>        i  theta .chain .iteration .draw .warmup
#>    <int>  <dbl>  <int>      <int> <int> <lgl>  
#>  1     1  3.96       1          1     1 TRUE   
#>  2     1  0.124      1          2     2 TRUE   
#>  3     1 21.3        1          3     3 TRUE   
#>  4     1 14.7        1          4     4 TRUE   
#>  5     1  5.96       1          5     5 TRUE   
#>  6     1  5.76       1          6     6 TRUE   
#>  7     1  4.03       1          7     7 TRUE   
#>  8     1 -0.278      1          8     8 TRUE   
#>  9     1  1.81       1          9     9 TRUE   
#> 10     1  6.08       1         10    10 TRUE   
#> # ℹ 3,190 more rows

Created on 2024-04-22 with reprex v2.1.0

Now, if you do need .warmup to uniquely identify draws, that's another story. Probably these functions should have a draw_indices argument like unspread_draws and ungather_draws do.

jfsalzmann commented 2 months ago

Thanks @mjskay. So indeed, in my case .warmup would not be required to uniquely identify draws, however I mostly use gather_draws and there, when including .warmup just like any other variable, it will end up appearing in the .variable column which puts the problem straight, I believe.

posterior %>% gather_draws(mu_y,.warmup) %>% arrange(.variable)
# A tibble: 12,000 × 5
# Groups:   .variable [2]
   .chain .iteration .draw .variable .value
    <int>      <int> <int> <chr>      <dbl>
 1      1          1     1 .warmup        1
 2      1          2     2 .warmup        1
 3      1          3     3 .warmup        1
 4      1          4     4 .warmup        1
 5      1          5     5 .warmup        1
 6      1          6     6 .warmup        1
 7      1          7     7 .warmup        1
 8      1          8     8 .warmup        1
 9      1          9     9 .warmup        1
10      1         10    10 .warmup        1
# ℹ 11,990 more rows
# ℹ Use `print(n = ...)` to see more rows

I think a draw_indices argument for gather_draws and spread_draws, possibly, would help a lot.

mjskay commented 2 months ago

The github version now has a draw_indices parameter to gather_draws and spread_draws, let me know if it doesn't do what you need.

jfsalzmann commented 2 months ago

Amazing, thank you very much! Tested it and seems to work fine. Initially I had a feeling gather_draws and spread_draws are a bit slower now, but I have also changed my code underway so I assume that's on my side.

mjskay commented 2 months ago

Yeah, I wouldn't expect this to affect the speed of those functions in any appreciable way. Glad it helps!