h2oai / db-benchmark

reproducible benchmark of database-like ops
https://h2oai.github.io/db-benchmark
Mozilla Public License 2.0
321 stars 85 forks source link

Whole script performance #177

Open ritchie46 opened 3 years ago

ritchie46 commented 3 years ago

Should the whole query be measured for a tool (loading data, casting types, answering question)?

I ask, because I am running the db-benchmark and the pandas solution takes most of its time converting strings to categorical dtype. Which doesn't seem totally fair, as the runtime cost of this conversion seems gigantic.

It's probably related to #20. It seems that if we convert to categorical we optimize for a specific operation but pay a cost somewhere else.

jangorecki commented 3 years ago

Thank you for suggestion. I recall discussion about having a task to measure whole script execution. The motivation was to have a test that will balance benefits of R's global string cache with cost of reading-in strings into R session (which is single threaded due to R's global cache). Now that we use categorical/factor this specific case is no longer valid, but I agree that having this kind of test would be useful. The most challenging is actually design it well.

Should the whole query be measured for a tool (loading data, casting types, answering question)?

Let's call it "benchmark script" rather than "query". Term "query" is being used for atomic queries against data (side note: we run 2 queries per question).

pandas solution takes most of its time converting strings to categorical dtype.

This can be (and going to be) outsourced to python datatable, same as we already outsource pandas read_csv to datatable fread.

Which doesn't seem totally fair, as the runtime cost of this conversion seems gigantic.

Strictly speaking it is fair for the "groupby" (or "join") task. Cost of importing (or casting) data into environment seems to best fit into "read" task: https://github.com/h2oai/db-benchmark/issues/131 Having processes well separated we can coherently present them on the report. I don't see any good way to put such extra timings into current benchmark plots.

ritchie46 commented 3 years ago

Right, I agree that casting could be seen as read or preparation and is not part of the groupby/ join operation.

Anyway.. Great work! And I hope that a whole script performance task becomes part of the benchmark.