tidyverse / vroom

Fast reading of delimited files
https://vroom.r-lib.org
Other
621 stars 60 forks source link

Add collector-level na support #541

Open khusmann opened 4 months ago

khusmann commented 4 months ago

This PR adds support for collector-level na args (#532). This way, different lists of missing values can be specified for each column, overriding the global na arg in the call to vroom().

Example:

vroom(
  I("a,b,c\na,foo,REFUSED\nb,REFUSED,MISSING\nOMITTED,bar,OMITTED\n"),
  col_types = cols(
    a = col_character(na = "OMITTED"),
    b = col_character(na = "REFUSED"),
    c = col_character()
  ),
  na = "MISSING"
)
#> # A tibble: 3 × 3
#>   a     b     c      
#>   <chr> <chr> <chr>  
#> 1 a     foo   REFUSED
#> 2 b     NA    NA     
#> 3 NA    bar   OMITTED

Without this PR, it is very difficult to efficiently read columns with different lists of missing values. Instead, they have to be loaded as character vectors, then parsed with readr::parse_*() or readr::type_convert(). There are two problems with this:

I'm hoping you'll consider this PR for inclusion to vroom – it only requires a few changes, is 100% backwards compatible, and adds a feature that cannot otherwise be implemented in a separate package (without duplicating all of vroom's internals). Please let me know if there is anything more I can do to advocate for it. Thank you for your consideration!

khusmann commented 4 months ago

Note that this is failing the check for windows-latest (3.6) because the runner is grabbing the latest version of evaluate, which now requires R >= 4.0.0.