Closed MarekGierlinski closed 1 year ago
You can use the id
argument of vroom()
for this.
id Either a string or 'NULL'. If a string, the output will contain a variable with that name with the filename(s) as the value. If 'NULL', the default, no variable will be created.
But this is not advertised well in vroom's documentation, I will admit. It is more discoverable in readr, which is where most vroom usage actually originates. Here's an example borrowed from readr:
library(vroom)
continents <- c("africa", "americas", "asia", "europe", "oceania")
filepaths <- vapply(
paste0("mini-gapminder-", continents, ".csv"),
FUN = readr::readr_example,
FUN.VALUE = character(1)
)
vroom(filepaths, id = "file")
#> Rows: 26 Columns: 6
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (1): country
#> dbl (4): year, lifeExp, pop, gdpPercap
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 26 × 6
#> file country year lifeExp pop gdpPercap
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 /Users/jenny/Library/R/arm64/4.3/libr… Algeria 1952 43.1 9.28e6 2449.
#> 2 /Users/jenny/Library/R/arm64/4.3/libr… Angola 1952 30.0 4.23e6 3521.
#> 3 /Users/jenny/Library/R/arm64/4.3/libr… Benin 1952 38.2 1.74e6 1063.
#> 4 /Users/jenny/Library/R/arm64/4.3/libr… Botswa… 1952 47.6 4.42e5 851.
#> 5 /Users/jenny/Library/R/arm64/4.3/libr… Burkin… 1952 32.0 4.47e6 543.
#> 6 /Users/jenny/Library/R/arm64/4.3/libr… Burundi 1952 39.0 2.45e6 339.
#> 7 /Users/jenny/Library/R/arm64/4.3/libr… Argent… 1952 62.5 1.79e7 5911.
#> 8 /Users/jenny/Library/R/arm64/4.3/libr… Bolivia 1952 40.4 2.88e6 2677.
#> 9 /Users/jenny/Library/R/arm64/4.3/libr… Brazil 1952 50.9 5.66e7 2109.
#> 10 /Users/jenny/Library/R/arm64/4.3/libr… Canada 1952 68.8 1.48e7 11367.
#> # ℹ 16 more rows
Created on 2023-08-07 with reprex v2.0.2.9000
I'm going to change the title of this issue to reflect the need for documentation.
"Reading multiple files" is featured prominently in the README, so that would be an obvious place to use or at least mention id
. Probably in addition to adding an example for vroom()
.
Oh, indeed, it is there. I was actually learning vroom
from the tidyverse blog, which also contains a section on reading multiple files. It would be nice to update this one too, if the author is available to do it.
Thanks a lot for your help and being so nice, as it is essentially an RTFM issue.
I often deal with data which is split into multiple files identified only by the file name. For example, experimental results from multiple samples, where the file name identifies the sample, but there is no information about the sample inside the file. It would be useful to have an option, while reading multiple files, to add a column with either the input file name or a name specified in a vector provided alongside input file names.
Here is an example. Consider 3 TSV files:
File
sample_a.txt
:File
sample_b.txt
:File
sample_c.txt
:A command:
would create the following tibble:
Alternatively, vector of names could be provided to be parsed into the column, for example
file_names = c("a", "b", "c")
would placea
,b
andc
instead of file names in the file names. You can probably come up with better names for these additional arguments.I hope I'm not the only one who would find this useful.