Usually I'm not diving this deep into some R stuff and happily use what tidyverse offers. This time something seems to be inefficient there.
I have a bunch of rather small files with data that need to be read in and fused to a continuous dataset. Also there is some information encoded in filenames which needs to be extracted and put into the dataset, too. Up to now my setup using vroom-1.6.5 under R version 4.3.2 worked fine, but I just had a bunch of files. Moving to bigger datasets I encountered some performance issues which I tried to track down and solve, but it seems to be way deeper than my understanding of R goes.
So lets put together some example with 1000 files containing 800 integers each (in total only about 4 MB of data):
I'm using data.table::fwrite() because readr::write_csv() (commented) is really slow on this job (8 s vs. 132 s) - but that's another story.
Reading back data with vroom() (altrep off!) or alternatively with data.table::fread():
# Read data with vroom
timerstart <- Sys.time()
data_vroom <- vroom::vroom(
file = filenames,
delim = c(","),
col_types = "i",
id = "src_file",
altrep = FALSE,
progress = FALSE,
num_threads = 2
)
print(paste0("vroom read: ", difftime(Sys.time(), timerstart, units = "secs")))
#> [1] "vroom read: 9.20988512039185"
# Read data with data.table
timerstart <- Sys.time()
data_fread_lists <- lapply(
filenames,
data.table::fread,
sep = ",",
colClasses = c("integer")
)
# Assign names to list
names(data_fread_lists) <- filenames
# Use names to label rows
data_fread <- data.table::rbindlist(
data_fread_lists,
idcol = "src_file"
)
rm(data_fread_lists)
print(paste0("fread read: ", difftime(Sys.time(), timerstart, units = "secs")))
#> [1] "fread read: 3.06666684150696"
There is also a speed difference, but that's not what I'm aiming for, it gets interesting when one wants to use the src_file column. But first we make sure that we have a similar starting point with both datasets by converting to tibble and dropping attributes:
Here is the thing I did not get for a while and only found that it can be "as expected" when going some really alternative route: I have two datasets being identical, but a simple mutate() takes unexpectedly long when data has been read with vroom() (about the same time as initial reading). Additionally it takes way longer than compared to the data read with fread() - again: Datasets look the same and are equal (see above)
Therefore I tried to (with my limited knowledge about R internals) inspect both objects before mutation and got:
Obviously there is something with "altrep" for the "src_file" column when data has been read with vroom(), which is not mentioned for the similar object read with fread().
The mutated objects do not contain this difference and very much look alike:
There may be more hints in the mass of numbers and abbreviations spit out by this function for the internals, but that's for sure not me interpreting those 😱.
Final test with mutating something again on those already changed objects without any "altrep" mentioned:
Bingo: The "altrep stuff" is gone and seems to be the cause for the unusual delay of mutate().
Side note: If one repeats the mutate action on the original objects (data_vroom_tibble/data_fread_tibble) timings do not change at all as I already experienced in the past when reading files with vroom "lazily".
So the points I would like to raise:
Even with altrep switched off in vroom() it still seems to be used somehow for the id column which seems not consistent to me
I would not mind if in the background it is still relying on this for good reasons, but obviously it does some major performance harm in my case.
(Rather long) example created on 2023-12-21 with reprex v2.0.2
Usually I'm not diving this deep into some R stuff and happily use what tidyverse offers. This time something seems to be inefficient there.
I have a bunch of rather small files with data that need to be read in and fused to a continuous dataset. Also there is some information encoded in filenames which needs to be extracted and put into the dataset, too. Up to now my setup using vroom-1.6.5 under R version 4.3.2 worked fine, but I just had a bunch of files. Moving to bigger datasets I encountered some performance issues which I tried to track down and solve, but it seems to be way deeper than my understanding of R goes.
So lets put together some example with 1000 files containing 800 integers each (in total only about 4 MB of data):
I'm using
data.table::fwrite()
becausereadr::write_csv()
(commented) is really slow on this job (8 s vs. 132 s) - but that's another story.Reading back data with
vroom()
(altrep off!) or alternatively withdata.table::fread()
:There is also a speed difference, but that's not what I'm aiming for, it gets interesting when one wants to use the
src_file
column. But first we make sure that we have a similar starting point with both datasets by converting to tibble and dropping attributes:Here is the thing I did not get for a while and only found that it can be "as expected" when going some really alternative route: I have two datasets being identical, but a simple
mutate()
takes unexpectedly long when data has been read withvroom()
(about the same time as initial reading). Additionally it takes way longer than compared to the data read withfread()
- again: Datasets look the same and are equal (see above)Therefore I tried to (with my limited knowledge about R internals) inspect both objects before mutation and got:
Obviously there is something with "altrep" for the "src_file" column when data has been read with
vroom()
, which is not mentioned for the similar object read withfread()
.The mutated objects do not contain this difference and very much look alike:
There may be more hints in the mass of numbers and abbreviations spit out by this function for the internals, but that's for sure not me interpreting those 😱.
Final test with mutating something again on those already changed objects without any "altrep" mentioned:
Bingo: The "altrep stuff" is gone and seems to be the cause for the unusual delay of
mutate()
.Side note: If one repeats the mutate action on the original objects (
data_vroom_tibble
/data_fread_tibble
) timings do not change at all as I already experienced in the past when reading files with vroom "lazily".So the points I would like to raise:
altrep
switched off invroom()
it still seems to be used somehow for theid
column which seems not consistent to me(Rather long) example created on 2023-12-21 with reprex v2.0.2