kids-first / kf-portal-etl

:factory: Extract-Transform-Load Pipeline for producing data for the Kids First Data Resource Portal
Apache License 2.0
5 stars 3 forks source link

ETL: Parallelize the loading of studies #77

Open aalex opened 5 years ago

aalex commented 5 years ago

As data release coordinators, we want the ETL to be as fast as possible.

Acceptance criteria

Technical discussion

Parsing all studies could be 10 minutes instead of 3h.

To run them in parallel, we could use AWS Fargate (or later AWS Lambdas) instead of a series of Docker containers who run in a large E2 instance, and that doesn't scale in/out. (so its yearly cost is higher)

Changing to AWS Lambdas would require more effort than changing to AWS Fargate, because it would mean removing Spark.

aalex commented 5 years ago

When it receives a HTTP POST is launches a Docker. Instead, we would launch many Fargate tasks. We will then be able to launch many at the same time. The task service is the only one that is not a fargate service. (Fargate runs docker serverless)