ETL: Parallelize the loading of studies

kids-first / kf-portal-etl

:factory: Extract-Transform-Load Pipeline for producing data for the Kids First Data Resource Portal

Apache License 2.0

5 stars 3 forks source link

As data release coordinators, we want the ETL to be as fast as possible.

Acceptance criteria

Parallelize the loading of studies

Technical discussion

Parsing all studies could be 10 minutes instead of 3h.

To run them in parallel, we could use AWS Fargate (or later AWS Lambdas) instead of a series of Docker containers who run in a large E2 instance, and that doesn't scale in/out. (so its yearly cost is higher)

Changing to AWS Lambdas would require more effort than changing to AWS Fargate, because it would mean removing Spark.

kids-first / kf-portal-etl

ETL: Parallelize the loading of studies #77

Acceptance criteria

Technical discussion