tidyverse / readxl

Read excel files (.xls and .xlsx) into R 🖇
https://readxl.tidyverse.org
Other
729 stars 194 forks source link

Error with duplicate names if some packages are loaded + numbering of duplicates #482

Closed gregleleu closed 6 years ago

gregleleu commented 6 years ago

Hi,

I'm having recurrent issues while reading an excel file with duplicate column names after some package are loaded, e.g. ggmosaic (see below). ggmosaic is one package I have identified but it happens with others I haven't been able to pinpoint.

Also, I've tried using the latest github version of readxl (instead of the CRAN one), but that breaks my code as duplicate names stop being numbered colname_2, colname_3 etc. but by their position in the original file eg. colname_18, colname_34 etc... Is that intentional? Is there a way to chose between both methods?

Thanks

> readxl::read_excel("test.xlsx")
# A tibble: 2 x 3
   colA  colB colB__1
  <dbl> <dbl>   <dbl>
1     1     2       3
2     2     3       4

> library(ggmosaic)
Loading required package: ggplot2
Loading required package: tibble
Loading required package: productplots

Attaching package: ‘ggmosaic’

The following objects are masked from ‘package:productplots’:

    ddecker, hspine, mosaic, prodcalc, spine, vspine

> readxl::read_excel("test.xlsx")
Error: Column `colB` must have a unique name

sessionInfo before loading ggmosaic (but after reading the excel file)

> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.4

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] compiler_3.5.0   assertthat_0.2.0 cli_1.0.0        tools_3.5.0      pillar_1.2.2     rstudioapi_0.7   tibble_1.4.2     crayon_1.3.4    
 [9] Rcpp_0.12.16     utf8_1.1.3       cellranger_1.1.0 readxl_1.1.0     rlang_0.2.0.9001

... and after

> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.4

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ggmosaic_0.1.2.9000 productplots_0.1.1  tibble_1.4.2        ggplot2_2.2.1.9000 

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.16      cellranger_1.1.0  pillar_1.2.2      compiler_3.5.0    plyr_1.8.4        bindr_0.1.1       tools_3.5.0       digest_0.6.15    
 [9] jsonlite_1.5      gtable_0.2.0      viridisLite_0.3.0 pkgconfig_2.0.1   rlang_0.2.0.9001  cli_1.0.0         rstudioapi_0.7    bindrcpp_0.2.2   
[17] withr_2.1.2       dplyr_0.7.4       httr_1.3.1        htmlwidgets_1.2   grid_3.5.0        glue_1.2.0        data.table_1.11.2 R6_2.2.2         
[25] plotly_4.7.1.9000 readxl_1.1.0      purrr_0.2.4       tidyr_0.8.0       magrittr_1.5      scales_0.5.0.9000 htmltools_0.3.6   assertthat_0.2.0 
[33] colorspace_1.3-2  utf8_1.1.3        lazyeval_0.2.1    munsell_0.4.3     crayon_1.3.4     
jennybc commented 6 years ago

I will try to reproduce what you report with CRAN readxl then ggmosaic.

As for this:

Also, I've tried using the latest github version of readxl (instead of the CRAN one), but that breaks my code as duplicate names stop being numbered colname_2, colname_3 etc. but by their position in the original file eg. colname_18, colname_34 etc... Is that intentional? Is there a way to chose between both methods?

It is very intentional. I explain somewhat in the blog post announcing v1.1.0 (look at the "Future outlook" section near the end) and also in the NEWS items that will be part of the next release: https://github.com/tidyverse/readxl/blame/master/NEWS.md#L3-L9. That contains a link to the issue where the tidyverse team discusses name repair.

batpigandme commented 6 years ago

I'm not able to reproduce the issue with the dev. version of readxl and the CRAN version of ggmosaic.

head(readxl::read_excel("nba_teams.xlsx"))
#> # A tibble: 6 x 7
#>   id                    location  name    nickname slug  team_lab logo_lab
#>   <chr>                 <chr>     <chr>   <chr>    <chr> <chr>    <chr>   
#> 1 20901970-53a0-417c-b… Atlanta   Atlanta Hawks    nba-… atl      ATL     
#> 2 1c65bbb6-bd10-4ef6-8… Boston    Boston  Celtics  nba-… bos      BOS     
#> 3 84eb19ca-1e66-416f-9… Brooklyn  Brookl… Nets     nba-… bk       BKN     
#> 4 68b04d26-12c3-4e06-8… Charlotte Charlo… Hornets  nba-… cha      CHA     
#> 5 7e670063-ef8d-4356-9… Chicago   Chicago Bulls    nba-… chi      CHI     
#> 6 a9abb922-3a47-4d37-9… Cleveland Clevel… Cavalie… nba-… cle      CLE

library(ggmosaic)
#> Loading required package: ggplot2
#> Loading required package: productplots
#> 
#> Attaching package: 'ggmosaic'
#> The following objects are masked from 'package:productplots':
#> 
#>     ddecker, hspine, mosaic, prodcalc, spine, vspine

head(readxl::read_excel("nba_teams.xlsx"))
#> # A tibble: 6 x 7
#>   id                    location  name    nickname slug  team_lab logo_lab
#>   <chr>                 <chr>     <chr>   <chr>    <chr> <chr>    <chr>   
#> 1 20901970-53a0-417c-b… Atlanta   Atlanta Hawks    nba-… atl      ATL     
#> 2 1c65bbb6-bd10-4ef6-8… Boston    Boston  Celtics  nba-… bos      BOS     
#> 3 84eb19ca-1e66-416f-9… Brooklyn  Brookl… Nets     nba-… bk       BKN     
#> 4 68b04d26-12c3-4e06-8… Charlotte Charlo… Hornets  nba-… cha      CHA     
#> 5 7e670063-ef8d-4356-9… Chicago   Chicago Bulls    nba-… chi      CHI     
#> 6 a9abb922-3a47-4d37-9… Cleveland Clevel… Cavalie… nba-… cle      CLE

Created on 2018-05-17 by the reprex package (v0.2.0).

Here's a link to the spreadsheet I used: https://www.dropbox.com/s/2h3o3xvq0rqzfnv/nba_teams.xlsx?dl=0

jennybc commented 6 years ago

I believe the report is re: the CRAN version of readxl + ggmosaic and, presumably, a spreadsheet with duplicate names (though sheet not provided by OP).

jennybc commented 6 years ago

I cannot reproduce the problem with v1.1.0 of readxl. I have a test sheet that requires lots of name repair, so I've used it as my subject.

packageVersion("readxl")
#> [1] '1.1.0'
readxl::read_excel("~/rrr/readxl/tests/testthat/sheets/unnamed-duplicated-columns.xlsx")
#> # A tibble: 2 x 4
#>    X__1 var2   X__2 var2__1
#>   <dbl> <chr> <dbl> <chr>  
#> 1     1 a       1.1 aa     
#> 2     2 b       2.1 bb
library(ggmosaic)
#> Loading required package: ggplot2
#> Loading required package: productplots
#> 
#> Attaching package: 'ggmosaic'
#> The following objects are masked from 'package:productplots':
#> 
#>     ddecker, hspine, mosaic, prodcalc, spine, vspine
readxl::read_excel("~/rrr/readxl/tests/testthat/sheets/unnamed-duplicated-columns.xlsx")
#> # A tibble: 2 x 4
#>    X__1 var2   X__2 var2__1
#>   <dbl> <chr> <dbl> <chr>  
#> 1     1 a       1.1 aa     
#> 2     2 b       2.1 bb

Created on 2018-05-17 by the reprex package (v0.2.0).

Session info ``` r devtools::session_info() #> ─ Session info ────────────────────────────────────────────────────────── #> setting value #> version R version 3.4.3 (2017-11-30) #> os macOS Sierra 10.12.6 #> system x86_64, darwin15.6.0 #> ui X11 #> language (EN) #> collate en_CA.UTF-8 #> tz America/Vancouver #> date 2018-05-17 #> #> ─ Packages ────────────────────────────────────────────────────────────── #> package * version date #> ansistrings 1.0.0.9000 2018-04-17 #> assertthat 0.2.0 2017-04-11 #> backports 1.1.2 2018-04-14 #> bindr 0.1.1 2018-03-13 #> bindrcpp 0.2.2 2018-03-29 #> callr 2.0.3 2018-04-25 #> cellranger 1.1.0.9000 2018-04-03 #> cli 1.0.0.9002 2018-04-17 #> clisymbols 1.2.0 2017-05-21 #> colorspace 1.3-2 2016-12-14 #> crayon 1.3.4 2018-04-02 #> data.table 1.10.4-3 2017-10-27 #> debugme 1.1.0.9000 2018-01-29 #> desc 1.2.0 2018-05-01 #> devtools 1.13.5.9000 2018-05-04 #> digest 0.6.15 2018-01-28 #> dplyr 0.7.4.9000 2018-03-09 #> evaluate 0.10.1 2017-06-24 #> ggmosaic * 0.1.2 2017-02-09 #> ggplot2 * 2.2.1 2016-12-30 #> glue 1.2.0.9000 2018-05-07 #> gtable 0.2.0 2016-02-26 #> hms 0.4.2 2018-04-16 #> htmltools 0.3.6 2017-04-28 #> htmlwidgets 1.0 2018-01-20 #> httr 1.3.1 2017-08-20 #> jsonlite 1.5 2017-06-01 #> knitr 1.20 2018-02-20 #> lazyeval 0.2.1 2017-10-29 #> magrittr 1.5 2014-11-22 #> memoise 1.1.0 2018-01-15 #> munsell 0.4.3 2016-02-13 #> pillar 1.2.2 2018-04-26 #> pkgbuild 1.0.0 2018-05-10 #> pkgconfig 2.0.1 2017-03-21 #> pkgload 1.0.0 2018-05-10 #> plotly 4.7.1 2017-07-29 #> plyr 1.8.4 2016-06-08 #> prettyunits 1.0.2 2015-07-13 #> productplots * 0.1.1 2016-07-02 #> progress 1.1.2.9003 2018-04-17 #> purrr 0.2.4.9000 2018-03-29 #> R6 2.2.2 2017-06-17 #> Rcpp 0.12.16 2018-03-13 #> readxl 1.1.0 2018-04-20 #> rlang 0.2.0.9001 2018-04-25 #> rmarkdown 1.9.8 2018-03-31 #> rprojroot 1.3-2 2018-01-03 #> scales 0.5.0 2017-08-24 #> selectr 0.4-1 2018-04-06 #> sessioninfo 1.0.1.9000 2017-10-26 #> stringi 1.2.2 2018-05-02 #> stringr 1.3.0 2018-02-19 #> testthat 2.0.0 2017-12-13 #> tibble 1.4.2.9001 2018-04-25 #> tidyr 0.8.0 2018-01-29 #> tidyselect 0.2.4 2018-02-26 #> usethis 1.3.0.9000 2018-05-08 #> utf8 1.1.3 2018-01-03 #> viridisLite 0.2.0 2017-03-24 #> withr 2.1.2 2018-03-18 #> xml2 1.2.0 2018-01-24 #> yaml 2.1.19 2018-05-01 #> source #> Github (r-lib/ansistrings@4e4d309) #> CRAN (R 3.4.0) #> Github (r-lib/backports@cee9348) #> CRAN (R 3.4.4) #> CRAN (R 3.4.4) #> Github (r-lib/callr@3e433ba) #> local #> Github (r-lib/cli@c408924) #> CRAN (R 3.4.0) #> CRAN (R 3.4.0) #> Github (gaborcsardi/crayon@95b3eae) #> CRAN (R 3.4.2) #> Github (r-lib/debugme@f4a116a) #> CRAN (R 3.4.3) #> Github (r-lib/devtools@13ee56b) #> CRAN (R 3.4.3) #> Github (tidyverse/dplyr@9f707d2) #> CRAN (R 3.4.1) #> CRAN (R 3.4.0) #> CRAN (R 3.4.0) #> Github (tidyverse/glue@7230ed2) #> CRAN (R 3.4.0) #> Github (tidyverse/hms@c0cfc01) #> CRAN (R 3.4.0) #> CRAN (R 3.4.3) #> CRAN (R 3.4.1) #> CRAN (R 3.4.0) #> CRAN (R 3.4.3) #> CRAN (R 3.4.2) #> CRAN (R 3.4.0) #> Github (hadley/memoise@611cfad) #> CRAN (R 3.4.0) #> CRAN (R 3.4.3) #> Github (r-lib/pkgbuild@0457039) #> CRAN (R 3.4.0) #> Github (r-lib/pkgload@35efedd) #> CRAN (R 3.4.1) #> CRAN (R 3.4.0) #> CRAN (R 3.4.0) #> CRAN (R 3.4.0) #> Github (gaborcsardi/progress@e525de4) #> Github (tidyverse/purrr@84ce1ad) #> CRAN (R 3.4.0) #> CRAN (R 3.4.4) #> CRAN (R 3.4.4) #> Github (tidyverse/rlang@82b2727) #> Github (rstudio/rmarkdown@8cfc878) #> CRAN (R 3.4.2) #> CRAN (R 3.4.1) #> CRAN (R 3.4.4) #> Github (r-lib/sessioninfo@c871d01) #> CRAN (R 3.4.3) #> CRAN (R 3.4.3) #> CRAN (R 3.4.3) #> Github (tidyverse/tibble@b90a4fd) #> CRAN (R 3.4.3) #> CRAN (R 3.4.3) #> local #> CRAN (R 3.4.2) #> CRAN (R 3.4.0) #> Github (r-lib/withr@79d7b0d) #> CRAN (R 3.4.3) #> CRAN (R 3.4.3) ```
jennybc commented 6 years ago

@gregleleu I note that several of the package potentially involved are at dev versions for you.

Specifically, ggmosaic, and ggplot2 catch my eye. Can you reproduce the problem with CRAN versions of one or both of those?

jennybc commented 6 years ago

I think it's very likely this is a problem in ggmosiac, which is exporting its own method for as_tibble.list().

https://github.com/haleyjeppson/ggmosaic/blob/9c88d627b8b50432e00918ccd6a3405477257f83/R/scale-product.R#L83

cc @haleyjeppson

jennybc commented 6 years ago

Issue filed in ggmosaic, which is where I believe this problem originates: https://github.com/haleyjeppson/ggmosaic/issues/21