Closed alistaire47 closed 2 years ago
Since you've identified this, would it be worth modifying the parquet taxi benchmark to reflect this new-ish query?
We can, but I assume it will change the expected time, which is why I didn't. We could
a. version it (though we can't pass that all the way through to conbench yet), or b. call it something else (do we keep the old one then? I could go either way)
If we go with a., we should flip all the collects()
in that one to last
a., we should flip all the collects() in that one to last
WRT to this specific point, I'd lean towards a. simply because I'm sure the relevance of a post collect
summarise to arrow.
Closes 96. Largely copies
bm-dataset-taxi-parquet.R
, but adjust queries for available fields and data structure. Also pushescollect()
s to aftersummarize()
as we can now (presumably we couldn't when the old one was written).The new benchmark is pretty slow (~6m) because the dataset is CSVs; on parquet the queries run much faster. It might be interesting to extend
ensure_format()
so it works with datasets so we could compare the two, provided we don't mind the disk space usage.The tests are wimpy because downloading the data takes a long time (~1h).