h2oai / db-benchmark

reproducible benchmark of database-like ops
https://h2oai.github.io/db-benchmark
Mozilla Public License 2.0
323 stars 85 forks source link

Data I/O timings #77

Closed MichaelChirico closed 5 years ago

MichaelChirico commented 5 years ago

Seeing all sorts of praise for vroom as being "faster" than fread recently.

It seems nobody is reading the small print -- vroom is just a lazy reader, which means that while I/O itself will feel fast, subsequent analysis will be (much?) slower... not to mention I presume it rates a 0 on robustness to non-cookie-cutter csvs.

It would be helpful to evaluate my suspicion quantitatively by building vroom into one of the pipelines here, not to mention it would be useful to include the start-up I/O costs of the other languages as well.

jangorecki commented 5 years ago

In benchmarks we are forcing evaluation so the whole benefits of laziness are being lost. Therefore I don't think we should focus on it, unless it is fast also when not lazy. It could be useful when we would for example groupby and subset, then subset can use laziness and read only required rows for groupby. We don't have such test at the moment.

MichaelChirico commented 5 years ago

I think your second point gets more at what I had in mind, but I can't seem to articulate what exactly I have in mind, so closing for now.