tidyverse / vroom

Fast reading of delimited files
https://vroom.r-lib.org
Other
621 stars 60 forks source link

‘vroom’ silently truncates CSV files with mismatched quotes #484

Open klmr opened 1 year ago

klmr commented 1 year ago

See the blog post and the discussion it triggered.

Reprex:

> vroom::vroom('x.csv', show_col_types = FALSE)
# A tibble: 0 × 4
# … with 4 variables: A <chr>, B <chr>, C <chr>, D <chr>
# ℹ Use `colnames()` to see all variable names

> data.table::fread('x.csv')
     A   B                   C  D
1: foo bar "ba\nz,bat\n1,2,3,4 NA
Warning message:
In data.table::fread("x.csv") :
  Detected 4 column names but the data has 3 columns. Filling rows automatically. Set fill=TRUE explicitly to avoid this warning.

> cat(paste(readLines('x.csv'), collapse = '\n'))
A,B,C,D
foo,bar,"ba
z,bat
1,2,3,4

Of course CSV encompasses various formats but even so it’s not clear why ‘vroom’ thinks that the file is valid but has no data rows (despite having more than one line). I therefore guess this is unintentional (I can't think of a situation where silently dropping rows would be the expected behaviour).

From a user perspective, there are two likely scenarios:

  1. a broken file which is missing the closing quotation mark;
  2. a file without quotes, and quote = "" should have been passed.

(2) is a user error and can thus be ignored here (in fact, passing quote = "" leads to warnings on the above file, which is expected). (1) should ideally generate a warning or even an error.

hadley commented 1 year ago

Would you mind making this a self-contained reprex?

klmr commented 1 year ago

Apologies, I’ve no idea why I didn’t initially post it as one.

writeLines(
  c('A,B,C,D', 'foo,bar,"ba', 'z,bat'),
  'x.csv'
)

vroom::vroom('x.csv', show_col_types = FALSE)
#> # A tibble: 0 × 4
#> # ℹ 4 variables: A <chr>, B <chr>, C <chr>, D <chr>

data.table::fread('x.csv')
#> Warning in data.table::fread("x.csv"): Detected 4 column names but the data has
#> 3 columns. Filling rows automatically. Set fill=TRUE explicitly to avoid this
#> warning.
#>      A   B          C  D
#> 1: foo bar "ba\nz,bat NA
hadley commented 1 year ago

Thanks! A little simple/focussed on the specific problem:

vroom::vroom(
  I(c('A,B,C', 'd,e,"f', 'g,h,i')),
  quote = '"',
  show_col_types = FALSE
)
#> # A tibble: 0 × 3
#> # ℹ 3 variables: A <chr>, B <chr>, C <chr>

Created on 2023-08-02 with reprex v2.0.2