Expected behaviour for grouped data + sample_n_keys

emitanaka commented 4 years ago

I was hoping to sample 1 key per group as below but the output seems to be a bit random where I get some correct but another sampling gets 2 samples instead of 1 and so on.

library(tsibble)
library(brolgar)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

set.seed(1)
out <- ChickWeight %>% 
  as_tsibble(key = Chick, index = Time) %>% 
  group_by(Diet) %>% # 4 diets
  sample_n_keys(1) # expecting 1 chick per diet

# shows as 50 chicks - is it a tsibble thing?
out
#> # A tsibble: 48 x 4 [1]
#> # Key:       Chick [50]
#> # Groups:    Diet [4]
#>    weight  Time Chick Diet 
#>     <dbl> <dbl> <ord> <fct>
#>  1     41     0 13    1    
#>  2     48     2 13    1    
#>  3     53     4 13    1    
#>  4     60     6 13    1    
#>  5     65     8 13    1    
#>  6     67    10 13    1    
#>  7     71    12 13    1    
#>  8     70    14 13    1    
#>  9     71    16 13    1    
#> 10     81    18 13    1    
#> # … with 38 more rows

# actual number of chicks sampled
# the number sampled seems random. Sometimes it is correct, some times like below.
out %>% 
  distinct(Chick, Diet)
#> # A tibble: 7 x 2
#> # Groups:   Diet [4]
#>   Chick Diet 
#>   <ord> <fct>
#> 1 13    1    
#> 2 30    2    
#> 3 22    2    
#> 4 37    3    
#> 5 36    3    
#> 6 45    4    
#> 7 43    4

^{Created on 2020-11-01 by the reprex package (v0.3.0.9001)}

Session info

``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.0.1 (2020-06-06) #> os macOS Catalina 10.15.7 #> system x86_64, darwin17.0 #> ui X11 #> language (EN) #> collate en_AU.UTF-8 #> ctype en_AU.UTF-8 #> tz Australia/Melbourne #> date 2020-11-01 #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date lib source #> anytime 0.3.9 2020-08-27 [1] CRAN (R 4.0.2) #> assertthat 0.2.1 2019-03-21 [2] CRAN (R 4.0.0) #> backports 1.1.10 2020-09-15 [1] CRAN (R 4.0.2) #> brolgar * 0.0.6.9100 2020-10-30 [1] Github (njtierney/brolgar@28e95bb) #> cli 2.1.0 2020-10-12 [1] CRAN (R 4.0.2) #> colorspace 1.4-1 2019-03-18 [1] CRAN (R 4.0.2) #> crayon 1.3.4 2017-09-16 [2] CRAN (R 4.0.0) #> digest 0.6.27 2020-10-24 [1] CRAN (R 4.0.2) #> distributional 0.2.1 2020-10-06 [1] CRAN (R 4.0.2) #> dplyr * 1.0.1 2020-07-26 [1] Github (tidyverse/dplyr@16647fc) #> ellipsis 0.3.1 2020-05-15 [2] CRAN (R 4.0.0) #> evaluate 0.14 2019-05-28 [2] CRAN (R 4.0.0) #> fabletools 0.2.1 2020-09-03 [1] CRAN (R 4.0.2) #> fansi 0.4.1 2020-01-08 [2] CRAN (R 4.0.0) #> farver 2.0.3.9000 2020-07-24 [1] Github (thomasp85/farver@f1bcb56) #> fs 1.5.0 2020-07-31 [1] CRAN (R 4.0.2) #> generics 0.0.2 2018-11-29 [2] CRAN (R 4.0.0) #> ggplot2 3.3.2 2020-06-19 [1] CRAN (R 4.0.2) #> glue 1.4.2 2020-08-27 [1] CRAN (R 4.0.2) #> gtable 0.3.0 2019-03-25 [2] CRAN (R 4.0.0) #> highr 0.8 2019-03-20 [2] CRAN (R 4.0.0) #> htmltools 0.5.0 2020-06-16 [1] CRAN (R 4.0.2) #> knitr 1.29 2020-06-23 [1] CRAN (R 4.0.2) #> lifecycle 0.2.0 2020-03-06 [1] CRAN (R 4.0.0) #> lubridate 1.7.9 2020-06-08 [2] CRAN (R 4.0.1) #> magrittr 1.5 2014-11-22 [2] CRAN (R 4.0.0) #> munsell 0.5.0 2018-06-12 [2] CRAN (R 4.0.0) #> pillar 1.4.6 2020-07-10 [1] CRAN (R 4.0.1) #> pkgconfig 2.0.3 2019-09-22 [2] CRAN (R 4.0.0) #> purrr 0.3.4 2020-04-17 [2] CRAN (R 4.0.0) #> R6 2.5.0 2020-10-28 [1] CRAN (R 4.0.2) #> Rcpp 1.0.5 2020-07-06 [1] CRAN (R 4.0.0) #> reprex 0.3.0.9001 2020-08-08 [1] Github (tidyverse/reprex@9594ee9) #> rlang 0.4.8 2020-10-08 [1] CRAN (R 4.0.2) #> rmarkdown 2.3 2020-06-18 [1] CRAN (R 4.0.2) #> rstudioapi 0.11 2020-02-07 [2] CRAN (R 4.0.0) #> scales 1.1.1 2020-05-11 [2] CRAN (R 4.0.0) #> sessioninfo 1.1.1 2018-11-05 [2] CRAN (R 4.0.0) #> stringi 1.4.6 2020-02-17 [2] CRAN (R 4.0.0) #> stringr 1.4.0 2019-02-10 [2] CRAN (R 4.0.0) #> styler 1.3.2 2020-02-23 [1] CRAN (R 4.0.1) #> tibble 3.0.4 2020-10-12 [1] CRAN (R 4.0.2) #> tidyr 1.1.2 2020-08-27 [1] CRAN (R 4.0.2) #> tidyselect 1.1.0 2020-05-11 [2] CRAN (R 4.0.0) #> tsibble * 0.9.3.9000 2020-11-01 [1] Github (tidyverts/tsibble@e749eb6) #> utf8 1.1.4 2018-05-24 [2] CRAN (R 4.0.0) #> vctrs 0.3.2.9000 2020-07-26 [1] Github (r-lib/vctrs@df8a659) #> withr 2.3.0 2020-09-22 [1] CRAN (R 4.0.2) #> xfun 0.16 2020-07-24 [1] CRAN (R 4.0.2) #> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.2) #> #> [1] /Users/etan0038/Library/R/4.0/library #> [2] /Library/Frameworks/R.framework/Versions/4.0/Resources/library ```

dicook commented 4 years ago

I think facet_strata() might do it to plot the result. But you’d like sample to do similarly, I think

Sent from my iPhone

On 1 Nov 2020, at 9:25 pm, Emi Tanaka notifications@github.com wrote:

I was hoping to sample 1 key per group as below but the output seems to be a bit random where I get some correct but another sampling gets 2 samples instead of 1 and so on.

library(tsibble) library(brolgar) library(dplyr)

>

> Attaching package: 'dplyr'

> The following objects are masked from 'package:stats':

>

> filter, lag

> The following objects are masked from 'package:base':

>

> intersect, setdiff, setequal, union

set.seed(1) out <- ChickWeight %>% as_tsibble(key = Chick, index = Time) %>% group_by(Diet) %>% # 4 diets sample_n_keys(1) # expecting 1 chick per diet

shows as 50 chicks - is it a tsibble thing?

out

> # A tsibble: 48 x 4 [1]

> # Key: Chick [50]

> # Groups: Diet [4]

> weight Time Chick Diet

>

> 1 41 0 13 1

> 2 48 2 13 1

> 3 53 4 13 1

> 4 60 6 13 1

> 5 65 8 13 1

> 6 67 10 13 1

> 7 71 12 13 1

> 8 70 14 13 1

> 9 71 16 13 1

> 10 81 18 13 1

> # … with 38 more rows

actual number of chicks sampled

the number sampled seems random. Sometimes it is correct, some times like below.

out %>% distinct(Chick, Diet)

> # A tibble: 7 x 2

> # Groups: Diet [4]

> Chick Diet

>

> 1 13 1

> 2 30 2

> 3 22 2

> 4 37 3

> 5 36 3

> 6 45 4

> 7 43 4

Created on 2020-11-01 by the reprex package (v0.3.0.9001)

Session info — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

dicook commented 4 years ago

Oh, I think it is simply that brolgar assumes unique keys. Similarly, tsibble is also assuming unique keys.

You need to use tibble/dplyr: group_by(your_character_variable) %>% sample_n()

Think of your variable as a grouping variable rather than an id variable. It would be interesting to think of handling this as an extension of tsibble, brolgar. Its a tsibble with replicates %^)

dicook commented 4 years ago

Oh, nope, its something not working in sample_n_keys(), eg

wages %>% group_by(black) %>% sample_n_keys(2)

ignores the group_by

njtierney commented 4 years ago

Thanks for posting the issue, @emitanaka !

This seems like a bug, I'll fix this before submitting to CRAN.

emitanaka commented 4 years ago

@dicook actually the behaviour is random. If you repeat your command, occasionally it shows some rows. The bug could be related to that it still thinks the number of keys is the same as the big data (not sure)

njtierney / brolgar

Expected behaviour for grouped data + sample_n_keys #104

>

> Attaching package: 'dplyr'

> The following objects are masked from 'package:stats':

>

> filter, lag

> The following objects are masked from 'package:base':

>

> intersect, setdiff, setequal, union

shows as 50 chicks - is it a tsibble thing?

> # A tsibble: 48 x 4 [1]

> # Key: Chick [50]

> # Groups: Diet [4]

> weight Time Chick Diet

>

> 1 41 0 13 1

> 2 48 2 13 1

> 3 53 4 13 1

> 4 60 6 13 1

> 5 65 8 13 1

> 6 67 10 13 1

> 7 71 12 13 1

> 8 70 14 13 1

> 9 71 16 13 1

> 10 81 18 13 1

> # … with 38 more rows

actual number of chicks sampled

the number sampled seems random. Sometimes it is correct, some times like below.

> # A tibble: 7 x 2

> # Groups: Diet [4]

> Chick Diet

>

> 1 13 1

> 2 30 2

> 3 22 2

> 4 37 3

> 5 36 3

> 6 45 4

> 7 43 4