Parsing in-memory data is slow

Currently when passing string data to ‘vroom’ (i.e. via vroom::vroom(I(mydata), ...)) it writes that memory out to disk into a temporary file before reading the data back in. Unfortunately this round-trip adds a substantial overhead: in a quick test, reading a memory-based table with 200k rows (~ 50 MiB) was almost 2.5 times slower than a disk-based file (1.47s±0.41 vs. 0.59±0.06). And it’s slower than using read.csv() (including manual data conversion). This is a single data point, though reproducible for the same data.

Interestingly, using shared memory/tmpfs doesn’t help very much: the overhead probably comes mostly from the actual copying of the data and the kernel context switching at the file IO boundary.

This might well be out of scope for ‘vroom’. — If I understand correctly, the whole point of this package is the memory-based indexing/lazy loading of disk storage. However, ‘vroom’ is now the backing implementation for ‘readr’, and it would be great if ‘readr’ efficiently supported memory-based data as a first-class citizen.

(My use-case is tabular data received as CSV from a web API, so the data never needs to hit the disk. The web API in question specifically returns CSV because JSON would be too large/slow. Of course receiving data over the network will usually be slower than disk access anyway, so the actual overhead is a lot less than the 2.5x quoted above.)

tidyverse / vroom

Parsing in-memory data is slow #460