Decide on a data format for hubverse cloud storage

bsweger commented 9 months ago

In addition to deciding how we want to organize the data for hubs in the cloud, we also want to determine a default data format.

I'm not sure if we'd want to provide this as admin-configurable option in the future, but it would be great to have an opinion re: a sensible default to get started.

nickreich commented 9 months ago

do we not just want to say "parquet" given the initial results? is that what you mean by "format"? (also, fwiw, I'm posting this comment from Slack.)

bsweger commented 9 months ago

Adding this issue to retroactively capture the decision we made in the 2024-02-14 Hubverse dev meeting (and to make a clearer distinction between the data organization conversation (#11) and the data format)

Based on the benchmarking work that @annakrystalli did (https://partition-benchmarking.netlify.app/), parquet emerged as a good candidate for hubverse data storage.

If we adopt a process of retaining the hub submissions on S3 in their original form (in addition to providing client-facing parquet files), hubs that collect model outputs in .csv format will still be able to access the original .csv files if needed.

bsweger commented 9 months ago

@nickreich ha, we must have been typing at the same time

Yes. that was the decision, I just wanted to log it here.

hubverse-org / hubverse-cloud

Decide on a data format for hubverse cloud storage #18