Is vroom too memory greedy and disk intensive?

Hello,

I discovered vroom recently while I was searching for a way to read only specific rows inside a csv. The exact condition was that the column "index" of the output had to contain all the indices I needed but not the other values.

After researches on internet, I found the following subscript that interested me:

test=vroom::vroom(i)|> dplyr::filter(idpixel %in% IndicesNeeded)

where i is the filename.

Looking at the task manager, I noticed huge read and write activities on my internal drive during the mapping. This raised a first red flag for me: I need to scrap around 3000 files, each between 1 and 20 GB. This implies a lot of stress on my internal drive and I don't want to burn it down, can we define another space where that intensive operations could be done?

Another point that I noticed is the memory greed: once the filter is performed, there is no need to keep all the memory allocation, especially given that I don't know a way to get back the data in the middle of a pipe. To free up the memory allocated for the whole database, I have to convert the test extracted DB to another class, e.g. a data.table and perform a gc() afterwards with a command like:

test=as.data.table(test)

As far as I understand, it means that the pointers that create the memory allocation are transfered through the pipe and not recomputed for the data targeted. As the rest of my workflow relies on data.table this patch does not hamper my work. However, I did not find informations about that behaviour anywhare.

Regards,

Charles

PS:

vroom version 1.6.1

dplyr version 1.1.2

tidyverse / vroom

Is vroom too memory greedy and disk intensive? #507