tidyverse / vroom

Fast reading of delimited files
https://vroom.r-lib.org
Other
621 stars 60 forks source link

Is vroom too memory greedy and disk intensive? #507

Closed CharlesNickmilder closed 1 year ago

CharlesNickmilder commented 1 year ago

Hello,

I discovered vroom recently while I was searching for a way to read only specific rows inside a csv. The exact condition was that the column "index" of the output had to contain all the indices I needed but not the other values.

After researches on internet, I found the following subscript that interested me:

test=vroom::vroom(i)|> dplyr::filter(idpixel %in% IndicesNeeded)

where i is the filename.

Looking at the task manager, I noticed huge read and write activities on my internal drive during the mapping. This raised a first red flag for me: I need to scrap around 3000 files, each between 1 and 20 GB. This implies a lot of stress on my internal drive and I don't want to burn it down, can we define another space where that intensive operations could be done?

Another point that I noticed is the memory greed: once the filter is performed, there is no need to keep all the memory allocation, especially given that I don't know a way to get back the data in the middle of a pipe. To free up the memory allocated for the whole database, I have to convert the test extracted DB to another class, e.g. a data.table and perform a gc() afterwards with a command like:

test=as.data.table(test)

As far as I understand, it means that the pointers that create the memory allocation are transfered through the pipe and not recomputed for the data targeted. As the rest of my workflow relies on data.table this patch does not hamper my work. However, I did not find informations about that behaviour anywhare.

Regards,

Charles

PS:

vroom version 1.6.1

dplyr version 1.1.2

jennybc commented 1 year ago

I was just re-watching the video below to answer a different question, but I think it's also relevant to your use case. My main advice is that perhaps you should be pre-filtering the input on the way in to R, as opposed to after reading the entire file into R. Based on what I see above, you should be able to express your filter in some concise way inside a pipe() call, which you can use with vroom(). vroom comes up in the video around the 9 minute mark.

https://youtu.be/RYhwZW6ofbI?si=HEGTk4o2P6-4zG6m