tidyverse / readr

Read flat files (csv, tsv, fwf) into R
https://readr.tidyverse.org
Other
1.01k stars 286 forks source link

`read_delim_chunked` takes much more memory than expected? #1410

Open timothy-barry opened 2 years ago

timothy-barry commented 2 years ago

I am using the read_delim_chunked function to process large text files chunk-by-chunk. My expectation is that memory is cleared after each chunk is read. However, this does not seem to be the case. The amount of memory required to read the text file (by chunking) is the same as the amount of memory to read the text file (without chunking). I assume that this is a bug, but maybe my understanding of read_delim_chunked is incorrect. The purpose of reading by chunk is to conserve memory, right? Thanks!

timothy-barry commented 2 years ago

After a bit of searching through the issues on this repo, I noticed that at least one other person seems to be encountering this issue as well: https://github.com/tidyverse/readr/issues/1120#issuecomment-1055255383_.

timothy-barry commented 2 years ago

Additional note: this seems to be a more pervasive issue than I had realized. I tried loading a sequence of files via readr::read_delim. R ran out of memory despite the fact that (i) each file itself fits into memory and (ii) I loaded the files 1-by-1.

# readr: runs out-of-memory
for (f in fs) {
  print(paste0("Loading ", f))
  x <- readr::read_delim(file = f,
                         delim = " ",
                         skip = 2,
                         col_types = c("iii"))
  rm(x); gc()
}

I repeated this experiment with data.table's fread function; everything works as expected.

# data.table: everything works
for (f in fs) {
  print(paste0("Loading ", f))
  x <- data.table::fread(file = f,
                         sep = " ",
                         colClasses = c("integer", "integer", "integer"),
                         skip = 2)
  rm(x); gc()
}

As far as I can tell, the current version of readr seems to suffer from more global memory leak issues, unfortunately.

ben18785 commented 1 year ago

I am having the same issue. The memory use increases almost monotonically even though the individual chunks are small.

timothy-barry commented 1 year ago

Any updates or workarounds? Can I use edition 1 (via with_edition(1, ...) or local_edition(1)) to resolve this issue, at least for the time being?

arthurgailes commented 1 year ago

Having the same problem here.

hadley commented 1 year ago

To investigate this issue we'll need a reprex, and some indication of how you're measuring R's memory consumption.