tidyverse / readr

Read flat files (csv, tsv, fwf) into R
https://readr.tidyverse.org
Other
1.01k stars 284 forks source link

Improve handling of multiline strings #1491

Open rmgpanw opened 1 year ago

rmgpanw commented 1 year ago

Hi, I was recently reading a table into R using the readr package but discovered that several rows were missing as some cells included a single double quotation mark.

I managed to resolve this by setting quote = "", however I wonder if the read_tsv()/related functions could be updated to at least raise a warning if improper quoting is discovered in the data? The data.table package both raises a warning and deals with this automatically, as per the following reprex.

Many thanks for considering!

```R library(tibble) library(readr) library(data.table) # create example file df <- tribble( ~code, ~description, "1", '"Single quotes', "2", "No quotes", "3", 'Single quotes"', "4", '"Pair of quotes"', "5", "No quotes" ) file_path <- tempfile(pattern = "df_with_quotes", fileext = ".tsv") write_tsv(df, file_path, quote = "none", escape = "none" ) # `readr::read_tsv()` reads inappropriately without raising an error with # default settings read_tsv(file_path, quote = "\"" ) #> Rows: 3 Columns: 2 #> ── Column specification ──────────────────────────────────────────────────────── #> Delimiter: "\t" #> chr (1): description #> dbl (1): code #> #> ℹ Use `spec()` to retrieve the full column specification for this data. #> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. #> # A tibble: 3 × 2 #> code description #> #> 1 1 "Single quotes\n2\tNo quotes\n3\tSingle quotes" #> 2 4 "Pair of quotes" #> 3 5 "No quotes" # To read correctly use the `quote` argument read_tsv(file_path, quote = "" ) #> Rows: 5 Columns: 2 #> ── Column specification ──────────────────────────────────────────────────────── #> Delimiter: "\t" #> chr (1): description #> dbl (1): code #> #> ℹ Use `spec()` to retrieve the full column specification for this data. #> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. #> # A tibble: 5 × 2 #> code description #> #> 1 1 "\"Single quotes" #> 2 2 "No quotes" #> 3 3 "Single quotes\"" #> 4 4 "\"Pair of quotes\"" #> 5 5 "No quotes" # `data.table::fread()` raises a warning but reads the table correctly with # default settings fread(file_path, quote = "\"" ) #> Warning in fread(file_path, quote = "\""): Found and resolved improper quoting #> in first 100 rows. If the fields are not quoted (e.g. field separator does not #> appear within any field), try quote="" to avoid this warning. #> code description #> 1: 1 "Single quotes #> 2: 2 No quotes #> 3: 3 Single quotes" #> 4: 4 Pair of quotes #> 5: 5 No quotes # setting `quote=""` produces (almost) the same result, now without a warning fread(file_path, quote = "" ) #> code description #> 1: 1 "Single quotes #> 2: 2 No quotes #> 3: 3 Single quotes" #> 4: 4 "Pair of quotes" #> 5: 5 No quotes sessionInfo() #> R version 4.2.0 (2022-04-22) #> Platform: x86_64-apple-darwin17.0 (64-bit) #> Running under: macOS Big Sur/Monterey 10.16 #> #> Matrix products: default #> BLAS: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib #> LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib #> #> locale: #> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 #> #> attached base packages: #> [1] stats graphics grDevices utils datasets methods base #> #> other attached packages: #> [1] data.table_1.14.6 readr_2.1.4 tibble_3.1.8 #> #> loaded via a namespace (and not attached): #> [1] pillar_1.8.1 compiler_4.2.0 R.methodsS3_1.8.2 R.utils_2.12.2 #> [5] tools_4.2.0 digest_0.6.31 bit_4.0.5 evaluate_0.20 #> [9] lifecycle_1.0.3 R.cache_0.16.0 pkgconfig_2.0.3 rlang_1.0.6 #> [13] reprex_2.0.2 cli_3.6.0 rstudioapi_0.14 parallel_4.2.0 #> [17] yaml_2.3.7 xfun_0.37 fastmap_1.1.0 withr_2.5.0 #> [21] styler_1.9.1 knitr_1.42 fs_1.6.1 vctrs_0.5.2 #> [25] hms_1.1.2 tidyselect_1.2.0 bit64_4.0.5 glue_1.6.2 #> [29] R6_2.5.1 fansi_1.0.4 vroom_1.6.1 rmarkdown_2.20 #> [33] purrr_1.0.1 tzdb_0.3.0 magrittr_2.0.3 htmltools_0.5.4 #> [37] ellipsis_0.3.2 utf8_1.2.3 crayon_1.5.2 R.oo_1.25.0 ```
daranzolin commented 1 year ago

I experienced the same issue with read_csv(). A single double quote preceding one of the values omitted the remaining records. Only the warning from fread() saved me:

image

hadley commented 1 year ago

Here's a somewhat more minimal reprex:

library(readr)

lines <- 'code,description
1,"x
2,-
3,y"'
path <- tempfile()
writeLines(lines, path)

# Allows quoted string to span multiple lines
read_csv(path, col_types = list())
#> # A tibble: 1 × 2
#>    code description  
#>   <dbl> <chr>        
#> 1     1 "x\n2,-\n3,y"

# OK: explicit quote works
read_csv(path, quote = "", col_types = list())
#> # A tibble: 3 × 2
#>    code description
#>   <dbl> <chr>      
#> 1     1 "\"x"      
#> 2     2 "-"        
#> 3     3 "y\""

# warns & treats lines as quoted
data.table::fread(path, quote = "\"")
#> Warning in data.table::fread(path, quote = "\""): Found and resolved improper
#> quoting in first 100 rows. If the fields are not quoted (e.g. field separator
#> does not appear within any field), try quote="" to avoid this warning.
#>    code description
#> 1:    1          "x
#> 2:    2           -
#> 3:    3          y"

Created on 2023-07-31 with reprex v2.0.2