h2oai / db-benchmark

reproducible benchmark of database-like ops
https://h2oai.github.io/db-benchmark
Mozilla Public License 2.0
325 stars 88 forks source link

Make datasets more accessible #241

Open MrPowers opened 2 years ago

MrPowers commented 2 years ago

Thanks for the excellent work on this project.

I'd like to experiment with the datasets and would rather not have to generate the datasets myself. I've never used R and don't really want to learn at this moment. I'm more interested in looking at stuff like if using broadcast joins would materially impact the Spark benchmarks.

Can you provide downloadable data files? Or can you make the files accessible on S3? I'm making important data files accessible to the community in a S3 bucket, so I'd also be happy to upload them there if that'd help.

Thanks again for building / maintaining this project. Hope I'll be able to contribute!

ncclementi commented 2 years ago

Hi there, checking in here, is there any update on having the data files available on an S3 bucket? I'd really appreciate it, especially for the 1e9 case which seems to have problems to create see https://github.com/h2oai/db-benchmark/issues/110

Thank you cc: @jangorecki

MrPowers commented 2 years ago

We could make the 50 GB accessible in S3 via multiple gzipped files that users could download and reassemble on their local machines too. That'd let uses download the file in parallel from S3 and limit the massive file problem. Thoughts @jangorecki / @ncclementi?

jangorecki commented 2 years ago

Hi, you need to contact h2o support. I am no longer maintainer of the project.

MrPowers commented 2 years ago

ok @jangorecki, will do. Thanks for your great contributions on this project.