nextstrain / ncov-ingest

A pipeline that ingests SARS-CoV-2 (i.e. nCoV) data from GISAID and Genbank, transforms it, stores it on S3, and triggers Nextstrain nCoV rebuilds.
MIT License
36 stars 20 forks source link

Explore parallelizing `transform-gisaid` #448

Open joverlee521 opened 3 months ago

joverlee521 commented 3 months ago

Originally proposed by @corneliusroemer in https://github.com/nextstrain/ncov-ingest/issues/446#issuecomment-2164897079

How hard is it to parallelize ~4h - transform-gisaid?

As this operates on ndjson lines, it might be parallelizable, or at least some part of it.

The obvious way to do so would be to have a split rule to divide input files into N chunks, run transform, and merge back.