Benchmark for new taxi dataset

alistaire47 commented 2 years ago

Closes 96. Largely copies bm-dataset-taxi-parquet.R, but adjust queries for available fields and data structure. Also pushes collect()s to after summarize() as we can now (presumably we couldn't when the old one was written).

The new benchmark is pretty slow (~6m) because the dataset is CSVs; on parquet the queries run much faster. It might be interesting to extend ensure_format() so it works with datasets so we could compare the two, provided we don't mind the disk space usage.

The tests are wimpy because downloading the data takes a long time (~1h).

alistaire47 commented 2 years ago

Since you've identified this, would it be worth modifying the parquet taxi benchmark to reflect this new-ish query?

We can, but I assume it will change the expected time, which is why I didn't. We could

a. version it (though we can't pass that all the way through to conbench yet), or b. call it something else (do we keep the old one then? I could go either way)

If we go with a., we should flip all the collects() in that one to last

boshek commented 2 years ago

a., we should flip all the collects() in that one to last

WRT to this specific point, I'd lean towards a. simply because I'm sure the relevance of a post collect summarise to arrow.

voltrondata-labs / arrowbench

Benchmark for new taxi dataset #106