tidyverse / vroom

Fast reading of delimited files
https://vroom.r-lib.org
Other
621 stars 60 forks source link

Include `id` column in `col_select` by default #457

Closed sbearrows closed 2 years ago

sbearrows commented 2 years ago

Fixes #416, Fixes #455

The id column now does not have to be included in col_select (but still can be).

out1 <- glue::glue("a,b,c,d
               1,2,3,4")
out2 <- glue::glue("a,b,c,d
               5,6,7,8")

tf1 <- withr::local_tempfile(fileext = ".csv", lines = out1)
tf2 <- withr::local_tempfile(fileext = ".csv", lines = out2)

# don't need to provide the id column and
vroom(
  c(tf1, tf2),
  id = "source",
  col_select = c("a", "c"),
  show_col_types = FALSE
)
#> # A tibble: 2 × 3
#>   source                  a     c
#>   <chr>               <dbl> <dbl>
#> 1 /var/folders/4g/9j…     1     3
#> 2 /var/folders/4g/9j…     5     7

# but you still have the option to include it
vroom(
  c(tf1, tf2),
  id = "source",
  col_select = c("a", "c", "source"),
  show_col_types = FALSE
)
#> # A tibble: 2 × 3
#>       a     c source             
#>   <dbl> <dbl> <chr>              
#> 1     1     3 /var/folders/4g/9j…
#> 2     5     7 /var/folders/4g/9j…

This PR also fixes a bug when providing numeric column positions to col_select when the id column was provided. Previously, when indexing columns, indexing started at id rather than the first column position making it impossible to select some column positions numerically.

library(vroom)
out1 <- glue::glue("a,b,c,d
               1,2,3,4")
out2 <- glue::glue("a,b,c,d
               5,6,7,8")

tf1 <- withr::local_tempfile(fileext = ".csv", lines = out1)
tf2 <- withr::local_tempfile(fileext = ".csv", lines = out2)

vroom(
  c(tf1, tf2),
  id = "source",
  col_select = c(1:3),
  show_col_types = FALSE
)
#> # A tibble: 2 × 3
#>   source                                                                 a     b
#>   <chr>                                                              <dbl> <dbl>
#> 1 /var/folders/4g/9jcx0hbd6r92152m0qrq64yw0000gn/T//Rtmp8bp0zR/file…     1     2
#> 2 /var/folders/4g/9jcx0hbd6r92152m0qrq64yw0000gn/T//Rtmp8bp0zR/file…     5     6

Numeric positions now relate only to the original data and the id column is included by default.

out1 <- glue::glue("a,b,c,d
               1,2,3,4")
out2 <- glue::glue("a,b,c,d
               5,6,7,8")

tf1 <- withr::local_tempfile(fileext = ".csv", lines = out1)
tf2 <- withr::local_tempfile(fileext = ".csv", lines = out2)

vroom(
  c(tf1, tf2),
  id = "source",
  col_select = c(1:3),
  show_col_types = FALSE
)
#> # A tibble: 2 × 4
#>   source            a     b     c
#>   <chr>         <dbl> <dbl> <dbl>
#> 1 /var/folders…     1     2     3
#> 2 /var/folders…     5     6     7
jennybc commented 2 years ago

Also, this in the PR description:

Fixes #416,#455

won't actually close the second issue.

https://docs.github.com/en/issues/tracking-your-work-with-issues/linking-a-pull-request-to-an-issue#linking-a-pull-request-to-an-issue-using-a-keyword

See the bit about "Multiple issues"

sbearrows commented 2 years ago

Okay, I think this is ready for another review @jennybc

sbearrows commented 2 years ago

I've opened an issue concerning col_select = list() in readr here https://github.com/tidyverse/readr/issues/1421

There was one list() I missed and then it looks like the docs for vroom_fwf(col_select=) comes from readr::read_fwf() so I think that should be fixed in the readr PR. But it also feels like readr should inherit from vroom and not the other way around?

jennybc commented 2 years ago

the docs for vroom_fwf(col_select=) comes from readr::read_fwf() so I think that should be fixed in the readr PR. But it also feels like readr should inherit from vroom and not the other way around?

Yes, I think the definitive docs for col_select should live in vroom because readr 1e doesn't actually implement col_select. Anyone successfully calling readr with col_select is actually calling vroom.

sbearrows commented 2 years ago

Okay, I think we should do that in a separate PR and push what we have now. I've opened an issue https://github.com/tidyverse/vroom/issues/461 to update the docs for vroom so that they aren't inherited from readr