hubverse-org / hubverse-cloud

Test hub for S3 data submission and storage
MIT License
0 stars 0 forks source link

How will we automate the conversion of hub data to parquet after syncing to S3? #20

Closed bsweger closed 8 months ago

bsweger commented 9 months ago

Per #18, we will default to parquet format for making hub data available in the cloud.

So far, we have a GitHub action that syncs model-output data to S3 exactly as submitted by teams (we sync admin/config files as well, but I'm assuming those aren't relevant to this conversation).

I'd advocate for retaining the submitted data in its original form (perhaps under a "raw data" path) and then landing the parquet version to the client/user-facing location.

nickreich commented 9 months ago

One relevant consideration might be that a hub might choose to have its files submitted as parquet (this is allowable in a hub schema). In this case, would we still need to duplicate the data in the cloud?

bsweger commented 9 months ago

@nickreich Good note! I think there is value in making a distinction between "raw data" and "user/client-facing data" for all hubs, regardless of their submission format:

The data I've seen in hubs so far is very small by cloud standards, so I'm not worried about duplication. [edited to add: the "raw data" would be for our internal use--or maybe for use by teams who want access to their human-readable submission data--but we wouldn't want it accessible by clients such as hubUtils]

bsweger commented 9 months ago

At a high-level (without getting into implementation details), would like to brainstorm how we might trigger follow-up data operations after a hub receives a new model-output submission:

  1. Lean into AWS: We could write a lambda function that converts data to parquet and use S3 triggers to run the function every time new data lands.
  2. Lean into GitHub: Package a data conversion function and have it run as an additional step to the "S3 sync" GitHub action. Have the GitHub action send both versions of the data (raw and parquet) to S3.
  3. Use both: Write a lambda that converts the data and use the GitHub action to trigger it after the raw data lands in S3.
  4. ?? What am I missing ??

Also noodling on some variables that would influence our choice:

I don't have much experience with S3 triggers/lambdas and plan to spend some time learning how it works.

elray1 commented 9 months ago

r.e. "What should happen if the data conversion step fails? Do we want our GitHub checks to fail and block the merge?" -- I think in general we should minimize burden on participating teams. So if data conversion fails but a team's contribution was valid, I'd like to say the submission was valid and the team is done with their work, and hub administrators have to follow up.

bsweger commented 8 months ago

At a high-level (without getting into implementation details), would like to brainstorm how we might trigger follow-up data operations after a hub receives a new model-output submission:

  1. Lean into AWS: We could write a lambda function that converts data to parquet and use S3 triggers to run the function every time new data lands.
  2. Lean into GitHub: Package a data conversion function and have it run as an additional step to the "S3 sync" GitHub action. Have the GitHub action send both versions of the data (raw and parquet) to S3.
  3. Use both: Write a lambda that converts the data and use the GitHub action to trigger it after the raw data lands in S3.
  4. ?? What am I missing ??

Revisiting the above options now that we've had some additional conversations and experiment learnings (provisioning AWS resources via code.

I believe we should use a cloud-based trigger to initiate conversions/transformations on model-output files submitted to a hub.

Because we're already using AWS, I propose exploring the use of S3 triggers, which can invoke various actions when data is written or removed from an S3 bucket.

Will close this and follow-up with issues with more specific work re: experimentation with S3 triggers.