Closed bsweger closed 9 months ago
do we not just want to say "parquet" given the initial results? is that what you mean by "format"? (also, fwiw, I'm posting this comment from Slack.)
Adding this issue to retroactively capture the decision we made in the 2024-02-14 Hubverse dev meeting (and to make a clearer distinction between the data organization conversation (#11) and the data format)
Based on the benchmarking work that @annakrystalli did (https://partition-benchmarking.netlify.app/), parquet emerged as a good candidate for hubverse data storage.
If we adopt a process of retaining the hub submissions on S3 in their original form (in addition to providing client-facing parquet files), hubs that collect model outputs in .csv format will still be able to access the original .csv files if needed.
@nickreich ha, we must have been typing at the same time
Yes. that was the decision, I just wanted to log it here.
In addition to deciding how we want to organize the data for hubs in the cloud, we also want to determine a default data format.
I'm not sure if we'd want to provide this as admin-configurable option in the future, but it would be great to have an opinion re: a sensible default to get started.