tidyverse / readr

Read flat files (csv, tsv, fwf) into R
https://readr.tidyverse.org
Other
1.01k stars 286 forks source link

read_delim fails on non utf 8 charset when delim is NULL with R 4.3.1 #1508

Open nbc opened 1 year ago

nbc commented 1 year ago

When used on https://raw.githubusercontent.com/tidyverse/readr/main/tests/testthat/enc-iso-8859-1.txt with delim NULL, read_delim should fails with error :

Error: Could not guess the delimiter.

It works perfectly with R 4.2 but on R 4.3.1 it fails with error :

Error in gsub("\"[^\"]*\"", "", lines) : input string 1 is invalid
In addition: Warning message:
In gsub("\"[^\"]*\"", "", lines) :
  unable to translate 'fran<e7>ais' to a wide string

Complete reprex :

library(readr)

readr::read_delim(
  "https://raw.githubusercontent.com/tidyverse/readr/main/tests/testthat/enc-iso-8859-1.txt",
  delim = NULL,
  locale = readr::locale(encoding = "ISO-8859-1")
)

This is my sessionInfo() :

R version 4.3.1 (2023-06-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-serial/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-serial/libopenblas-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=fr_FR.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=fr_FR.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Paris
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] didoscalim_0.1.3.9000 testthat_3.1.10       devtools_2.4.5        usethis_2.2.2        

loaded via a namespace (and not attached):
 [1] htmlwidgets_1.6.2 remotes_2.4.2.1   processx_3.8.2    callr_3.7.3       tzdb_0.4.0        vctrs_0.6.3       tools_4.3.1       ps_1.7.5          generics_0.1.3   
[10] curl_5.0.2        parallel_4.3.1    tibble_3.2.1      fansi_1.0.4       pkgconfig_2.0.3   desc_1.4.2        lifecycle_1.0.3   compiler_4.3.1    stringr_1.5.0    
[19] brio_1.1.3        progress_1.2.2    httpuv_1.6.11     htmltools_0.5.6   later_1.3.1       pillar_1.9.0      crayon_1.5.2      urlchecker_1.0.1  tidyr_1.3.0      
[28] ellipsis_0.3.2    cachem_1.0.8      sessioninfo_1.2.2 mime_0.12         tidyselect_1.2.0  digest_0.6.33     stringi_1.7.12    dplyr_1.1.2       diffobj_0.3.5    
[37] purrr_1.0.2       rematch2_2.1.2    rprojroot_2.0.3   fastmap_1.1.1     cli_3.6.1         magrittr_2.0.3    pkgbuild_1.4.2    utf8_1.2.3        readr_2.1.4      
[46] withr_2.5.0       prettyunits_1.1.1 waldo_0.5.1       promises_1.2.1    bit64_4.0.5       lubridate_1.9.2   timechange_0.2.0  httr_1.4.6        bit_4.0.5        
[55] hms_1.1.3         memoise_2.0.1     shiny_1.7.5       miniUI_0.1.1.1    profvis_0.3.8     rlang_1.1.1       Rcpp_1.0.11       xtable_1.8-4      glue_1.6.2       
[64] pkgload_1.3.2.1   rstudioapi_0.15.0 vroom_1.6.3       jsonlite_1.8.7    R6_2.5.1          fs_1.6.3         
ramiromagno commented 1 year ago

Same here:

library(readr)

readr::read_delim(
  "https://raw.githubusercontent.com/tidyverse/readr/main/tests/testthat/enc-iso-8859-1.txt",
  delim = NULL,
  locale = readr::locale(encoding = "ISO-8859-1")
)
#> Warning in gsub("\"[^\"]*\"", "", lines): unable to translate 'fran<e7>ais' to
#> a wide string
#> Error in gsub("\"[^\"]*\"", "", lines): input string 1 is invalid

sessionInfo()
#> R version 4.3.1 (2023-06-16)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Arch Linux
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/libblas.so.3.11.0 
#> LAPACK: /usr/lib/liblapack.so.3.11.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Europe/Lisbon
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] readr_2.1.4
#> 
#> loaded via a namespace (and not attached):
#>  [1] crayon_1.5.2      vctrs_0.6.3       cli_3.6.1         knitr_1.44       
#>  [5] rlang_1.1.1       xfun_0.40         purrr_1.0.2       styler_1.10.2    
#>  [9] bit_4.0.5         glue_1.6.2        htmltools_0.5.6   hms_1.1.3        
#> [13] fansi_1.0.4       rmarkdown_2.25    R.cache_0.16.0    evaluate_0.21    
#> [17] tibble_3.2.1      tzdb_0.4.0        fastmap_1.1.1     yaml_2.3.7       
#> [21] lifecycle_1.0.3   compiler_4.3.1    fs_1.6.3          pkgconfig_2.0.3  
#> [25] rstudioapi_0.15.0 R.oo_1.25.0       R.utils_2.12.2    digest_0.6.33    
#> [29] R6_2.5.1          tidyselect_1.2.0  utf8_1.2.3        reprex_2.0.2     
#> [33] curl_5.0.2        parallel_4.3.1    vroom_1.6.3       pillar_1.9.0     
#> [37] magrittr_2.0.3    R.methodsS3_1.8.2 bit64_4.0.5       tools_4.3.1      
#> [41] withr_2.5.1

Created on 2023-09-29 with reprex v2.0.2