How will we automate the conversion of hub data to parquet after syncing to S3?

bsweger commented 9 months ago

Per #18, we will default to parquet format for making hub data available in the cloud.

So far, we have a GitHub action that syncs model-output data to S3 exactly as submitted by teams (we sync admin/config files as well, but I'm assuming those aren't relevant to this conversation).

I'd advocate for retaining the submitted data in its original form (perhaps under a "raw data" path) and then landing the parquet version to the client/user-facing location.

nickreich commented 9 months ago

One relevant consideration might be that a hub might choose to have its files submitted as parquet (this is allowable in a hub schema). In this case, would we still need to duplicate the data in the cloud?

bsweger commented 9 months ago

@nickreich Good note! I think there is value in making a distinction between "raw data" and "user/client-facing data" for all hubs, regardless of their submission format:

Using the same S3 structure for every hub simplifies our code (e.g., no need to perform different data sync operations for hubs that submit in parquet)
It gives us room for future data manipulations we might decide to do (re-partitioning, for example)

The data I've seen in hubs so far is very small by cloud standards, so I'm not worried about duplication. [edited to add: the "raw data" would be for our internal use--or maybe for use by teams who want access to their human-readable submission data--but we wouldn't want it accessible by clients such as hubUtils]

bsweger commented 9 months ago

At a high-level (without getting into implementation details), would like to brainstorm how we might trigger follow-up data operations after a hub receives a new model-output submission:

Lean into AWS: We could write a lambda function that converts data to parquet and use S3 triggers to run the function every time new data lands.
Lean into GitHub: Package a data conversion function and have it run as an additional step to the "S3 sync" GitHub action. Have the GitHub action send both versions of the data (raw and parquet) to S3.
Use both: Write a lambda that converts the data and use the GitHub action to trigger it after the raw data lands in S3.
?? What am I missing ??

Also noodling on some variables that would influence our choice:

What should happen if the data conversion step fails? Do we want our GitHub checks to fail and block the merge?
Size of the individual model-output submissions
Administrative burden: what is the Hubverse appetite for expanding our reliance on AWS infrastructure from simple cloud storage to "a place where we need to maintain and update data conversion functions while also managing the associated permissions model". And for option 1, we'd have to manage the S3 triggers for everyones' hub bucket.
Cost

I don't have much experience with S3 triggers/lambdas and plan to spend some time learning how it works.

elray1 commented 9 months ago

r.e. "What should happen if the data conversion step fails? Do we want our GitHub checks to fail and block the merge?" -- I think in general we should minimize burden on participating teams. So if data conversion fails but a team's contribution was valid, I'd like to say the submission was valid and the team is done with their work, and hub administrators have to follow up.

bsweger commented 8 months ago

At a high-level (without getting into implementation details), would like to brainstorm how we might trigger follow-up data operations after a hub receives a new model-output submission:

Lean into AWS: We could write a lambda function that converts data to parquet and use S3 triggers to run the function every time new data lands.

Lean into GitHub: Package a data conversion function and have it run as an additional step to the "S3 sync" GitHub action. Have the GitHub action send both versions of the data (raw and parquet) to S3.

Use both: Write a lambda that converts the data and use the GitHub action to trigger it after the raw data lands in S3.

?? What am I missing ??

Revisiting the above options now that we've had some additional conversations and experiment learnings (provisioning AWS resources via code.

I believe we should use a cloud-based trigger to initiate conversions/transformations on model-output files submitted to a hub.

If our eventual goal is to allow submissions directly to cloud storage (i.e., without submitting a PR to the hub's repo), we don't want to rely on a github action to initiation a post-submission data conversion process
Now that we have additional clarity for using infrastructure-as-code to provision hub resources, I'm less concerned about administrative burden for a cloud-based trigger

Because we're already using AWS, I propose exploring the use of S3 triggers, which can invoke various actions when data is written or removed from an S3 bucket.

Will close this and follow-up with issues with more specific work re: experimentation with S3 triggers.

hubverse-org / hubverse-cloud

How will we automate the conversion of hub data to parquet after syncing to S3? #20