tidyverse / vroom

Fast reading of delimited files
https://vroom.r-lib.org
Other
622 stars 60 forks source link

The `id` argument needs to be more discoverable from documentation #505

Closed MarekGierlinski closed 11 months ago

MarekGierlinski commented 1 year ago

I often deal with data which is split into multiple files identified only by the file name. For example, experimental results from multiple samples, where the file name identifies the sample, but there is no information about the sample inside the file. It would be useful to have an option, while reading multiple files, to add a column with either the input file name or a name specified in a vector provided alongside input file names.

Here is an example. Consider 3 TSV files:

File sample_a.txt:

gene    count
gene1   19
gene2   22
gene3   14

File sample_b.txt:

gene    count
gene1   26
gene2   24
gene3   18

File sample_c.txt:

gene    count
gene1   22
gene2   17
gene3   24

A command:

files <- fs::dir_ls()
df <- vroom(files, col_file_name = "sample_file")

would create the following tibble:

# A tibble: 9 × 3
  gene  count sample_file 
  <chr> <int> <chr>       
1 gene1    19 sample_a.txt
2 gene2    22 sample_a.txt
3 gene3    14 sample_a.txt
4 gene1    26 sample_b.txt
5 gene2    24 sample_b.txt
6 gene3    18 sample_b.txt
7 gene1    22 sample_c.txt
8 gene2    17 sample_c.txt
9 gene3    24 sample_c.txt

Alternatively, vector of names could be provided to be parsed into the column, for example file_names = c("a", "b", "c") would place a, b and c instead of file names in the file names. You can probably come up with better names for these additional arguments.

I hope I'm not the only one who would find this useful.

jennybc commented 1 year ago

You can use the id argument of vroom() for this.

id Either a string or 'NULL'. If a string, the output will contain a variable with that name with the filename(s) as the value. If 'NULL', the default, no variable will be created.

But this is not advertised well in vroom's documentation, I will admit. It is more discoverable in readr, which is where most vroom usage actually originates. Here's an example borrowed from readr:

library(vroom)

continents <- c("africa", "americas", "asia", "europe", "oceania")
filepaths <- vapply(
  paste0("mini-gapminder-", continents, ".csv"),
  FUN = readr::readr_example,
  FUN.VALUE = character(1)
)
vroom(filepaths, id = "file")
#> Rows: 26 Columns: 6
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (1): country
#> dbl (4): year, lifeExp, pop, gdpPercap
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 26 × 6
#>    file                                   country  year lifeExp    pop gdpPercap
#>    <chr>                                  <chr>   <dbl>   <dbl>  <dbl>     <dbl>
#>  1 /Users/jenny/Library/R/arm64/4.3/libr… Algeria  1952    43.1 9.28e6     2449.
#>  2 /Users/jenny/Library/R/arm64/4.3/libr… Angola   1952    30.0 4.23e6     3521.
#>  3 /Users/jenny/Library/R/arm64/4.3/libr… Benin    1952    38.2 1.74e6     1063.
#>  4 /Users/jenny/Library/R/arm64/4.3/libr… Botswa…  1952    47.6 4.42e5      851.
#>  5 /Users/jenny/Library/R/arm64/4.3/libr… Burkin…  1952    32.0 4.47e6      543.
#>  6 /Users/jenny/Library/R/arm64/4.3/libr… Burundi  1952    39.0 2.45e6      339.
#>  7 /Users/jenny/Library/R/arm64/4.3/libr… Argent…  1952    62.5 1.79e7     5911.
#>  8 /Users/jenny/Library/R/arm64/4.3/libr… Bolivia  1952    40.4 2.88e6     2677.
#>  9 /Users/jenny/Library/R/arm64/4.3/libr… Brazil   1952    50.9 5.66e7     2109.
#> 10 /Users/jenny/Library/R/arm64/4.3/libr… Canada   1952    68.8 1.48e7    11367.
#> # ℹ 16 more rows

Created on 2023-08-07 with reprex v2.0.2.9000

I'm going to change the title of this issue to reflect the need for documentation.

jennybc commented 1 year ago

"Reading multiple files" is featured prominently in the README, so that would be an obvious place to use or at least mention id. Probably in addition to adding an example for vroom().

MarekGierlinski commented 1 year ago

Oh, indeed, it is there. I was actually learning vroom from the tidyverse blog, which also contains a section on reading multiple files. It would be nice to update this one too, if the author is available to do it.

Thanks a lot for your help and being so nice, as it is essentially an RTFM issue.