tidyverse / vroom

Fast reading of delimited files
https://vroom.r-lib.org
Other
622 stars 60 forks source link

Vroom ignores col_names when dealing with imperfect data #522

Open D3SL opened 10 months ago

D3SL commented 10 months ago

Real world data is almost never perfect. Things like minor raggedness in a CSV can be caused by any number of things ranging from missing quotes around a string that contains the delimiter or simple typos. One of R's greatest strengths is just how good it is at dealing with situations like this. For example previously the trivial solution was defining placeholder columns in col_names (or equivalent). This would allow you to read the data and then clean it inside R:

with_edition(1,
read_csv(
  col_names = c("testrow","name","region","region2","test"),
  skip=1,
I("testrow,name,region,test\n
1,jim,footown,06\n
2,bob,footown,41\n
3,tom,footown, bobstreet,99\n
4,steve,footown, bobstreet,47\n
5,george,footown, bobstreet,62\n"))
)

# A tibble: 5 × 5
  testrow name   region  region2    test
    <dbl> <chr>  <chr>   <chr>     <dbl>
1       1 jim    footown 06           NA
2       2 bob    footown 41           NA
3       3 tom    footown bobstreet    99
4       4 steve  footown bobstreet    47
5       5 george footown bobstreet    62

In vroom, and now new versions of readr, this is impossible. Even with col_names explicitly defined there is no way to force readr/vroom to do the right thing.

vroom::vroom(
  col_names = c("testrow","name","region","region2","test"),
  skip=1,
I("testrow,name,region,test\n
1,jim,footown,06\n
2,bob,footown,41\n
3,tom,footown, bobstreet,99\n
4,steve,footown, bobstreet,47\n
5,george,footown, bobstreet,62\n"))

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 5 × 4
  testrow name   region  region2     
    <dbl> <chr>  <chr>   <chr>       
1       1 jim    footown 06          
2       2 bob    footown 41          
3       3 tom    footown bobstreet,99
4       4 steve  footown bobstreet,47
5       5 george footown bobstreet,62
Warning message:
One or more parsing issues, call `problems()` on your data frame for details, e.g.:
  dat <- vroom(...)

There's three very important issues here.

The first is that a de facto monopoly in the R ecosystem has once again made a very user-hostile breaking change without any announcement, warning, or even documentation.

The second is the documentation. Not only is this behavior not documented, the documentation that does exist explicitly leads users to believe the opposite will happen:

col_names Either TRUE, FALSEor a character vector of column names.

If TRUE, the first row of the input will be used as the column names, and will not be included in the data frame. If FALSE, column names will be generated automatically: X1, X2, X3 etc.

If col_names is a character vector, the values will be used as the names of the columns, and the first row of the input will be read into the first row of the output data frame.

Missing (NA) column names will generate a warning, and be filled in with dummy names ...1, ...2 etc. Duplicate column names will generate a warning and be made unique, see name_repairto control how this is done.

And the third is the behavior itself. It's a severe antipattern to have an argument like col_names and then silently ignore the user's input, leaving them wondering why they've provided 5 column names and the function is giving errors about expecting 4 columns.

The ideal solution is obviously that user input should be authoritative. If a user supplies 5 columns vroom should return 5 columns with NAs where appropriate. But at absolute minimum the documentation should be changed to explicitly state that col_names is only a suggestion and will be ignored based on what vroom decides under the hood.