hubverse-org / hubverse-cloud

Test hub for S3 data submission and storage
MIT License
0 stars 0 forks source link

Test the hubverse-aws-upload workflow against a large volume of data #51

Closed bsweger closed 4 months ago

bsweger commented 6 months ago

Our GitHub action workflow to send data to S3 has been working well in our tests of small data volumes. However, it would be useful to get some rough estimates on timing with large volumes of data (one reason: @lmullany is working to convert some archived hubs to hubverse format, and it would be great to make that data available on S3)

bsweger commented 4 months ago

Recording some numbers from the recent test of getting a forked version of the CDC's FluSight to the cloud.

Forked repo: https://github.com/bsweger/FluSight-forecast-hub/tree/main S3 bucket: bsweger-flusight-forecast (these will disappear once we're done testing)

number of model-output files

Before diving into a more detailed kind of integrity check, will run down why we're missing a file in coverted model-output folder (and we need an issue to track getting alerts out to the team when the transform lambda fails).

bsweger commented 4 months ago

The "missing" file is actually a README.md that wasn't converted to parquet. Granted, we should decide how we want to handle non-supported file types, the end result--at least from a file count perspective--is as expected.

Link to relevant CloudWatch log: https://us-east-1.console.aws.amazon.com/cloudwatch/home?region=us-east-1#logsV2:log-groups/log-group/$252Faws$252Flambda$252Fhubverse-transform-model-output/log-events//2024$252F05$252F15$252F$255B$2524LATEST$255Db02cd7a7365c4222a42644eb82e44f4a$3Fstart$3D1715735088996$26refEventId$3D38262171050108132906101381462495187801418062422382411780

bsweger commented 4 months ago

This exercise resulted in 2 hubData issues we should resolve:

To run some integrity checks that compare a hub's GitHub-based model-output files and the transformed versions of those files, I worked-around the above issues by:

  1. Updated the test hub's admin.jsonconfig to add .parquet as a valid file format
  2. Manually removed the parquet file with the invalid date format from the hub's S3 bucket (bsweger-flusight-forecast/model-output/FluSight-baseline_cat/2024-03-02-FluSight-baseline_cat.parquet)

Below is the R script to run some integrity checks: test_cloud_hub_data.txt

Console output from running the above:

Rscript test_cloud_hub_data.R
Warning message:
!  The following potentially invalid model output file not opened successfully.
/Users/rsweger/code/FluSight-forecast-hub/model-output/FluSight-baseline_cat/2024-03-02-FluSight-baseline_cat.csv
SubTreeFileSystem: s3://bsweger-flusight-forecast/
[1] "Comparing local and cloud row counts"
[1] TRUE
[1] "Comparing local and cloud row counts by model_id"
[1] TRUE
[1] "Comparing local and cloud schemas"
[1] TRUE
bsweger commented 4 months ago

AWS handled the "bursty" lambda function invocations successfully, though there was some throttling due to what appears to be a currency limit of 10. The image below represents the default content on the "monitoring" tab of the AWS Lambda console (showing lambda activity between 2024-05-15 01:02:00 and 2024-05-15 01:13:00 UTC, which is when the incoming test model-output files emitted the S3 events that trigger the lambda function)

Am not an expert in these charts, but adding some additional info after image:

Image

Image

The concurrency threshold of 10 for our lambda function may be because our AWS account is new: https://benellis.cloud/my-lambda-concurrency-applied-quota-is-only-10-but-why

bsweger commented 4 months ago

Gonna move this to done, now that we've onboarding the CDC's FluSight repo to the cloud. The archived FluSight data will have far more volume, but we can open new tickets if getting that onto the cloud surfaces additional isseus.