tidyverse / vroom

Fast reading of delimited files
https://vroom.r-lib.org
Other
622 stars 60 forks source link

Altrep always used for "id" column, slowing down processing of column massively #528

Open Henning-Lenz opened 8 months ago

Henning-Lenz commented 8 months ago

Usually I'm not diving this deep into some R stuff and happily use what tidyverse offers. This time something seems to be inefficient there.

I have a bunch of rather small files with data that need to be read in and fused to a continuous dataset. Also there is some information encoded in filenames which needs to be extracted and put into the dataset, too. Up to now my setup using vroom-1.6.5 under R version 4.3.2 worked fine, but I just had a bunch of files. Moving to bigger datasets I encountered some performance issues which I tried to track down and solve, but it seems to be way deeper than my understanding of R goes.

So lets put together some example with 1000 files containing 800 integers each (in total only about 4 MB of data):

library(magrittr)

# Create test files in tempdir
tempdirname <- tempdir()
filenames <- paste0(tempdirname, "\\testfile", seq(1, 1000), ".csv")
fakedata <- tibble::tibble("Value" = sample.int(800))

# Write using 
timerstart <- Sys.time()
for (filename in filenames) {
  # readr::write_csv(fakedata, file = filename, progress = FALSE)
  data.table::fwrite(fakedata, file = filename, showProgress = FALSE)
}
print(paste0("file write: ", difftime(Sys.time(), timerstart, units = "secs")))
#> [1] "file write: 7.75113010406494"

I'm using data.table::fwrite() because readr::write_csv() (commented) is really slow on this job (8 s vs. 132 s) - but that's another story.

Reading back data with vroom() (altrep off!) or alternatively with data.table::fread():

# Read data with vroom
timerstart <- Sys.time()
data_vroom <- vroom::vroom(
  file = filenames,
  delim = c(","),
  col_types = "i",
  id = "src_file",
  altrep = FALSE,
  progress = FALSE,
  num_threads = 2
)
print(paste0("vroom read: ", difftime(Sys.time(), timerstart, units = "secs")))
#> [1] "vroom read: 9.20988512039185"

# Read data with data.table
timerstart <- Sys.time()
data_fread_lists <- lapply(
  filenames,
  data.table::fread,
  sep = ",",
  colClasses = c("integer")
)
# Assign names to list
names(data_fread_lists) <- filenames
# Use names to label rows
data_fread <- data.table::rbindlist(
  data_fread_lists,
  idcol = "src_file"
)
rm(data_fread_lists)
print(paste0("fread read: ", difftime(Sys.time(), timerstart, units = "secs")))
#> [1] "fread read: 3.06666684150696"

There is also a speed difference, but that's not what I'm aiming for, it gets interesting when one wants to use the src_file column. But first we make sure that we have a similar starting point with both datasets by converting to tibble and dropping attributes:

# Try to make objects equal
data_vroom_tibble <- tibble::as_tibble(data_vroom)
data_fread_tibble <- tibble::as_tibble(data_fread)
attr(data_fread_tibble, ".internal.selfref") <- NULL
print(all(data_vroom_tibble == data_fread_tibble))
#> [1] TRUE

# Mutate something
timerstart <- Sys.time()
data_vroom_mutated <- data_vroom_tibble %>% dplyr::mutate(src_file_new = paste0(src_file, "_1")) %>% dplyr::select(-src_file)
print(paste0("vroom mutate: ", difftime(Sys.time(), timerstart, units = "secs")))
#> [1] "vroom mutate: 10.3021099567413"
timerstart <- Sys.time()
data_fread_mutated <- data_fread_tibble %>% dplyr::mutate(src_file_new = paste0(src_file, "_1")) %>% dplyr::select(-src_file)
print(paste0("fread mutate: ", difftime(Sys.time(), timerstart, units = "secs")))
#> [1] "fread mutate: 0.218961000442505"

Here is the thing I did not get for a while and only found that it can be "as expected" when going some really alternative route: I have two datasets being identical, but a simple mutate() takes unexpectedly long when data has been read with vroom() (about the same time as initial reading). Additionally it takes way longer than compared to the data read with fread() - again: Datasets look the same and are equal (see above)

Therefore I tried to (with my limited knowledge about R internals) inspect both objects before mutation and got:

# Inspect objects
print(lobstr::sxp(data_vroom_tibble))
#> [1:0x19dd5ae9918] <VECSXP[2]> (object named:10)
#>   src_file [2:0x19dd45a73d8] <STRSXP[8e+05]> (altrep named:65535)
#>   Value [3:0x19ddc84b040] <INTSXP[8e+05]> (named:38)
#>   _attrib [4:0x19dd488f578] <LISTSXP> (named:1)
#>     class [5:0x19dd6757088] <STRSXP[3]> (named:65535)
#>     row.names [6:0x19dd52ea168] <INTSXP[2]> (named:65535)
#>     names [7:0x19dd688d568] <STRSXP[2]> (named:65535)
print(lobstr::sxp(data_fread_tibble))
#> [1:0x19dd6002a98] <VECSXP[2/1026]> (object named:9)
#>   src_file [2:0x19dcdf84040] <STRSXP[8e+05]> (named:24)
#>   Value [3:0x19dce5a3040] <INTSXP[8e+05]> (named:25)
#>   _attrib [4:0x19dd4958b60] <LISTSXP> (named:1)
#>     class [5:0x19dd686cde8] <STRSXP[3]> (named:65535)
#>     row.names [6:0x19dd5379b08] <INTSXP[2]> (named:65535)
#>     names [7:0x19dcf8a0350] <STRSXP[2/1026]> (named:65535)

Obviously there is something with "altrep" for the "src_file" column when data has been read with vroom(), which is not mentioned for the similar object read with fread().

The mutated objects do not contain this difference and very much look alike:

# Inspect mutated objects
print(lobstr::sxp(data_vroom_mutated))
#> [1:0x16a0b4975a8] <VECSXP[2]> (object named:6)
#>   Value [2:0x16a1001c040] <INTSXP[8e+05]> (named:38)
#>   src_file_new [3:0x16a03104040] <STRSXP[8e+05]> (named:19)
#>   _attrib [4:0x16a0a55ab68] <LISTSXP> (named:1)
#>     names [5:0x16a0b497fe8] <STRSXP[2]> (named:65535)
#>     row.names [6:0x16a0aaa71b0] <INTSXP[2]> (named:3)
#>     class [7:0x16a0aff4258] <STRSXP[3]> (named:65535)
print(lobstr::sxp(data_fread_mutated))
#> [1:0x16a0b4bfbe8] <VECSXP[2]> (object named:6)
#>   Value [2:0x16a02ded040] <INTSXP[8e+05]> (named:25)
#>   src_file_new [3:0x16a0c4cf040] <STRSXP[8e+05]> (named:19)
#>   _attrib [4:0x16a0a7a3250] <LISTSXP> (named:1)
#>     names [5:0x16a0b4bffa8] <STRSXP[2]> (named:65535)
#>     row.names [6:0x16a0ac31530] <INTSXP[2]> (named:3)
#>     class [7:0x16a0b003928] <STRSXP[3]> (named:65535)

There may be more hints in the mass of numbers and abbreviations spit out by this function for the internals, but that's for sure not me interpreting those 😱.

Final test with mutating something again on those already changed objects without any "altrep" mentioned:

# Mutate something again
timerstart <- Sys.time()
data_vroom_mutated2 <- data_vroom_mutated %>% dplyr::mutate(src_file_new2 = paste0(src_file_new, "_2")) %>% dplyr::select(-src_file_new)
print(paste0("vroom mutate2: ", difftime(Sys.time(), timerstart, units = "secs")))
#> [1] "vroom mutate2: 0.220932960510254"
timerstart <- Sys.time()
data_fread_mutated2 <- data_fread_mutated %>% dplyr::mutate(src_file_new2 = paste0(src_file_new, "_2")) %>% dplyr::select(-src_file_new)
print(paste0("fread mutate2: ", difftime(Sys.time(), timerstart, units = "secs")))
#> [1] "fread mutate2: 0.311928987503052"

Bingo: The "altrep stuff" is gone and seems to be the cause for the unusual delay of mutate().

Side note: If one repeats the mutate action on the original objects (data_vroom_tibble/data_fread_tibble) timings do not change at all as I already experienced in the past when reading files with vroom "lazily".

So the points I would like to raise:

  1. Even with altrep switched off in vroom() it still seems to be used somehow for the id column which seems not consistent to me
  2. I would not mind if in the background it is still relying on this for good reasons, but obviously it does some major performance harm in my case.

(Rather long) example created on 2023-12-21 with reprex v2.0.2