tidyverse / tidyr

Tidy Messy Data
https://tidyr.tidyverse.org/
Other
1.38k stars 417 forks source link

problems with nesting by a factor when some levels are not present #554

Closed mkoohafkan closed 5 years ago

mkoohafkan commented 5 years ago

This works fine:

library(tidyverse)
mtcars %>% as_tibble() %>% mutate(cyl2 = factor(cyl, c(4L, 6L, 8L))) %>% nest(-cyl2)
## # A tibble: 3 x 2
##   cyl2  data              
##   <fct> <list>            
## 1 6     <tibble [7 x 11]> 
## 2 4     <tibble [11 x 11]>
## 3 8     <tibble [14 x 11]>

However, this does not:

mtcars %>% as_tibble() %>% mutate(cyl2 = factor(cyl, c(2L, 4L, 6L, 8L))) %>% nest(-cyl2) 
## # A tibble: 3 x 2
##   cyl2  data              
##   <fct> <list>            
## 1 6     <tibble [14 x 11]>
## 2 4     <tibble [7 x 11]> 
## 3 8     <NULL>  

When there are additional levels in a factor for which there is no data, attempting to nest by this factor introduces NULL values. Even worse, the nesting IDs do not match what actually gets nested:

mtcars %>% as_tibble() %>% mutate(cyl2 = factor(cyl, c(2L, 4L, 6L, 8L))) %>% nest(-cyl2) %>% slice(1) %>% unnest(data)
## # A tibble: 14 x 12
##    cyl2    mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##    <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 6      18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
##  2 6      14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
##  3 6      16.4     8  276.   180  3.07  4.07  17.4     0     0     3     3
##  4 6      17.3     8  276.   180  3.07  3.73  17.6     0     0     3     3

Note that cyl and cyl2 do not match in the second case.

mkoohafkan commented 5 years ago

Seems like to be the same issue as #542 but the issue is not fixed with the development version of dplr. Session info below:

R version 3.5.2 (2018-12-20)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 16299)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252 
[2] LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] forcats_0.4.0    stringr_1.4.0    dplyr_0.8.0.9000 purrr_0.3.0     
[5] readr_1.3.1      tidyr_0.8.2      tibble_2.0.1     ggplot2_3.1.0   
[9] tidyverse_1.2.1 

loaded via a namespace (and not attached):
 [1] tidyselect_0.2.5  remotes_2.0.2     haven_2.1.0       lattice_0.20-38  
 [5] colorspace_1.4-0  generics_0.0.2    testthat_2.0.1    usethis_1.4.0    
 [9] utf8_1.1.4        rlang_0.3.1       pkgbuild_1.0.2    pillar_1.3.1     
[13] glue_1.3.0        withr_2.1.2       modelr_0.1.4      sessioninfo_1.1.1
[17] readxl_1.3.0      plyr_1.8.4        munsell_0.5.0     gtable_0.2.0     
[21] cellranger_1.1.0  rvest_0.3.2       devtools_2.0.1    memoise_1.1.0    
[25] callr_3.1.1       ps_1.3.0          curl_3.3          fansi_0.4.0      
[29] broom_0.5.1       Rcpp_1.0.0        backports_1.1.3   scales_1.0.0     
[33] desc_1.2.0        pkgload_1.0.2     jsonlite_1.6      fs_1.2.6         
[37] hms_0.4.2         digest_0.6.18     stringi_1.3.1     processx_3.2.1   
[41] grid_3.5.2        rprojroot_1.3-2   cli_1.0.1         tools_3.5.2      
[45] magrittr_1.5      lazyeval_0.2.1    crayon_1.3.4      pkgconfig_2.0.2  
[49] xml2_1.2.0        prettyunits_1.0.2 lubridate_1.7.4   rstudioapi_0.9.0 
[53] assertthat_0.2.0  httr_1.4.0        R6_2.4.0          nlme_3.1-137     
[57] compiler_3.5.2 
batpigandme commented 5 years ago

Note that this does not happen using forcats::as_factor() as opposed to base factor() which you're using above (the warning is just because I didn't take out the factor level args):

library(tidyverse)
mtcars %>% as_tibble() %>% mutate(cyl2 = as_factor(cyl, c(4L, 6L, 8L))) %>% nest(-cyl2)
#> Warning: Some components of ... were not used: ..1
#> # A tibble: 3 x 2
#>   cyl2  data              
#>   <fct> <list>            
#> 1 6     <tibble [7 × 11]> 
#> 2 4     <tibble [11 × 11]>
#> 3 8     <tibble [14 × 11]>
mtcars %>% as_tibble() %>% mutate(cyl2 = as_factor(cyl, c(2L, 4L, 6L, 8L))) %>% nest(-cyl2)
#> Warning: Some components of ... were not used: ..1
#> # A tibble: 3 x 2
#>   cyl2  data              
#>   <fct> <list>            
#> 1 6     <tibble [7 × 11]> 
#> 2 4     <tibble [11 × 11]>
#> 3 8     <tibble [14 × 11]>
mtcars %>% as_tibble() %>% mutate(cyl2 = as_factor(cyl, c(2L, 4L, 6L, 8L))) %>% nest(-cyl2) %>% slice(1) 
#> Warning: Some components of ... were not used: ..1
#> # A tibble: 1 x 2
#>   cyl2  data             
#>   <fct> <list>           
#> 1 6     <tibble [7 × 11]>
mtcars %>% as_tibble() %>% mutate(cyl2 = as_factor(cyl, c(2L, 4L, 6L, 8L))) %>% nest(-cyl2) %>% slice(1) %>% unnest(data)
#> Warning: Some components of ... were not used: ..1
#> # A tibble: 7 x 12
#>   cyl2    mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 6      21       6  160    110  3.9   2.62  16.5     0     1     4     4
#> 2 6      21       6  160    110  3.9   2.88  17.0     0     1     4     4
#> 3 6      21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
#> 4 6      18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
#> 5 6      19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
#> 6 6      17.8     6  168.   123  3.92  3.44  18.9     1     0     4     4
#> 7 6      19.7     6  145    175  3.62  2.77  15.5     0     1     5     6

Created on 2019-02-19 by the reprex package (v0.2.1.9000)

pkq commented 5 years ago

However, applying forcats::as_factor() to an existing factor variable doesn't seem to fix the issue. Using a larger dataset (e.g., gapminder), the first NULL value doesn't show up until row 30...

library(tidyverse)
library(gapminder)

str(gapminder)
#> Classes 'tbl_df', 'tbl' and 'data.frame':    1704 obs. of  6 variables:
#>  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
#>  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
#>  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
#>  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
#>  $ pop      : int  8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
#>  $ gdpPercap: num  779 821 853 836 740 ...

gapminder %>%
  mutate_at(vars(country, continent),
            list(name = ~as_factor(.))) %>% 
  nest(-country, -continent) %>% 
  slice(25:34)
#> # A tibble: 10 x 3
#>    country          continent data             
#>    <fct>            <fct>     <list>           
#>  1 China            Asia      <tibble [12 × 6]>
#>  2 Colombia         Americas  <tibble [12 × 6]>
#>  3 Comoros          Africa    <tibble [12 × 6]>
#>  4 Congo, Dem. Rep. Africa    <tibble [12 × 6]>
#>  5 Congo, Rep.      Africa    <tibble [12 × 6]>
#>  6 Costa Rica       Americas  <NULL>           
#>  7 Cote d'Ivoire    Africa    <NULL>           
#>  8 Croatia          Europe    <NULL>           
#>  9 Cuba             Americas  <NULL>           
#> 10 Czech Republic   Europe    <NULL>

Created on 2019-02-19 by the reprex package (v0.2.1)

Session info ``` r devtools::session_info() #> ─ Session info ────────────────────────────────────────────────────────── #> setting value #> version R version 3.5.2 (2018-12-20) #> os macOS Mojave 10.14.3 #> system x86_64, darwin15.6.0 #> ui X11 #> language (EN) #> collate en_US.UTF-8 #> ctype en_US.UTF-8 #> tz America/Los_Angeles #> date 2019-02-19 #> #> ─ Packages ────────────────────────────────────────────────────────────── #> package * version date lib source #> assertthat 0.2.0 2017-04-11 [1] CRAN (R 3.5.0) #> backports 1.1.3 2018-12-14 [1] CRAN (R 3.5.0) #> broom 0.5.1 2018-12-05 [1] CRAN (R 3.5.0) #> callr 3.1.1 2018-12-21 [1] CRAN (R 3.5.0) #> cellranger 1.1.0 2016-07-27 [1] CRAN (R 3.5.0) #> cli 1.0.1 2018-09-25 [1] CRAN (R 3.5.0) #> colorspace 1.4-0 2019-01-13 [1] CRAN (R 3.5.2) #> crayon 1.3.4 2017-09-16 [1] CRAN (R 3.5.0) #> desc 1.2.0 2018-05-01 [1] CRAN (R 3.5.0) #> devtools 2.0.1 2018-10-26 [1] CRAN (R 3.5.2) #> digest 0.6.18 2018-10-10 [1] CRAN (R 3.5.0) #> dplyr * 0.8.0.1 2019-02-15 [1] CRAN (R 3.5.2) #> ellipsis 0.1.0 2019-02-19 [1] CRAN (R 3.5.2) #> evaluate 0.13 2019-02-12 [1] CRAN (R 3.5.2) #> fansi 0.4.0 2018-10-05 [1] CRAN (R 3.5.0) #> forcats * 0.4.0 2019-02-17 [1] CRAN (R 3.5.2) #> fs 1.2.6 2018-08-23 [1] CRAN (R 3.5.0) #> gapminder * 0.3.0 2017-10-31 [1] CRAN (R 3.5.0) #> generics 0.0.2 2018-11-29 [1] CRAN (R 3.5.0) #> ggplot2 * 3.1.0 2018-10-25 [1] CRAN (R 3.5.0) #> glue 1.3.0 2018-07-17 [1] CRAN (R 3.5.0) #> gtable 0.2.0 2016-02-26 [1] CRAN (R 3.5.0) #> haven 2.1.0 2019-02-19 [1] CRAN (R 3.5.2) #> highr 0.7 2018-06-09 [1] CRAN (R 3.5.0) #> hms 0.4.2 2018-03-10 [1] CRAN (R 3.5.0) #> htmltools 0.3.6 2017-04-28 [1] CRAN (R 3.5.0) #> httr 1.4.0 2018-12-11 [1] CRAN (R 3.5.0) #> jsonlite 1.6 2018-12-07 [1] CRAN (R 3.5.0) #> knitr 1.21 2018-12-10 [1] CRAN (R 3.5.2) #> lattice 0.20-38 2018-11-04 [2] CRAN (R 3.5.2) #> lazyeval 0.2.1 2017-10-29 [1] CRAN (R 3.5.0) #> lubridate 1.7.4 2018-04-11 [1] CRAN (R 3.5.0) #> magrittr 1.5 2014-11-22 [1] CRAN (R 3.5.0) #> memoise 1.1.0 2017-04-21 [1] CRAN (R 3.5.0) #> modelr 0.1.4 2019-02-18 [1] CRAN (R 3.5.2) #> munsell 0.5.0 2018-06-12 [1] CRAN (R 3.5.0) #> nlme 3.1-137 2018-04-07 [2] CRAN (R 3.5.2) #> pillar 1.3.1 2018-12-15 [1] CRAN (R 3.5.0) #> pkgbuild 1.0.2 2018-10-16 [1] CRAN (R 3.5.0) #> pkgconfig 2.0.2 2018-08-16 [1] CRAN (R 3.5.0) #> pkgload 1.0.2 2018-10-29 [1] CRAN (R 3.5.0) #> plyr 1.8.4 2016-06-08 [1] CRAN (R 3.5.0) #> prettyunits 1.0.2 2015-07-13 [1] CRAN (R 3.5.0) #> processx 3.2.1 2018-12-05 [1] CRAN (R 3.5.0) #> ps 1.3.0 2018-12-21 [1] CRAN (R 3.5.0) #> purrr * 0.3.0 2019-01-27 [1] CRAN (R 3.5.2) #> R6 2.4.0 2019-02-14 [1] CRAN (R 3.5.2) #> Rcpp 1.0.0 2018-11-07 [1] CRAN (R 3.5.0) #> readr * 1.3.1 2018-12-21 [1] CRAN (R 3.5.0) #> readxl 1.3.0 2019-02-15 [1] CRAN (R 3.5.2) #> remotes 2.0.2 2018-10-30 [1] CRAN (R 3.5.0) #> rlang 0.3.1 2019-01-08 [1] CRAN (R 3.5.2) #> rmarkdown 1.11 2018-12-08 [1] CRAN (R 3.5.0) #> rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.5.0) #> rvest 0.3.2 2016-06-17 [1] CRAN (R 3.5.0) #> scales 1.0.0 2018-08-09 [1] CRAN (R 3.5.0) #> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.5.0) #> stringi 1.3.1 2019-02-13 [1] CRAN (R 3.5.2) #> stringr * 1.4.0 2019-02-10 [1] CRAN (R 3.5.2) #> tibble * 2.0.1 2019-01-12 [1] CRAN (R 3.5.2) #> tidyr * 0.8.2 2018-10-28 [1] CRAN (R 3.5.0) #> tidyselect 0.2.5 2018-10-11 [1] CRAN (R 3.5.0) #> tidyverse * 1.2.1 2017-11-14 [1] CRAN (R 3.5.0) #> usethis 1.4.0 2018-08-14 [1] CRAN (R 3.5.0) #> utf8 1.1.4 2018-05-24 [1] CRAN (R 3.5.0) #> withr 2.1.2 2018-03-15 [1] CRAN (R 3.5.0) #> xfun 0.4 2018-10-23 [1] CRAN (R 3.5.0) #> xml2 1.2.0 2018-01-24 [1] CRAN (R 3.5.0) #> yaml 2.2.0 2018-07-25 [1] CRAN (R 3.5.0) #> #> [1] /Users/ppaine/Library/R/3.5/library #> [2] /Library/Frameworks/R.framework/Versions/3.5/Resources/library ```
batpigandme commented 5 years ago

All work for me if you install https://github.com/tidyverse/tidyr/pull/511, if you want to try it out.

remotes::install_github("tidyverse/tidyr#511")
library(tidyverse)
mtcars %>% as_tibble() %>% mutate(cyl2 = factor(cyl, c(2L, 4L, 6L, 8L))) %>% nest(-cyl2)
#> # A tibble: 3 x 2
#>   cyl2  data              
#>   <fct> <list>            
#> 1 4     <tibble [11 × 11]>
#> 2 6     <tibble [7 × 11]> 
#> 3 8     <tibble [14 × 11]>
mtcars %>% as_tibble() %>% mutate(cyl2 = factor(cyl, c(2L, 4L, 6L, 8L))) %>% nest(-cyl2) %>% slice(1) %>% unnest(data)
#> # A tibble: 11 x 12
#>    cyl2    mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>    <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1 4      22.8     4 108      93  3.85  2.32  18.6     1     1     4     1
#>  2 4      24.4     4 147.     62  3.69  3.19  20       1     0     4     2
#>  3 4      22.8     4 141.     95  3.92  3.15  22.9     1     0     4     2
#>  4 4      32.4     4  78.7    66  4.08  2.2   19.5     1     1     4     1
#>  5 4      30.4     4  75.7    52  4.93  1.62  18.5     1     1     4     2
#>  6 4      33.9     4  71.1    65  4.22  1.84  19.9     1     1     4     1
#>  7 4      21.5     4 120.     97  3.7   2.46  20.0     1     0     3     1
#>  8 4      27.3     4  79      66  4.08  1.94  18.9     1     1     4     1
#>  9 4      26       4 120.     91  4.43  2.14  16.7     0     1     5     2
#> 10 4      30.4     4  95.1   113  3.77  1.51  16.9     1     1     5     2
#> 11 4      21.4     4 121     109  4.11  2.78  18.6     1     1     4     2

Created on 2019-02-21 by the reprex package (v0.2.1.9000)