nextstrain / ncov-ingest

A pipeline that ingests SARS-CoV-2 (i.e. nCoV) data from GISAID and Genbank, transforms it, stores it on S3, and triggers Nextstrain nCoV rebuilds.
MIT License
36 stars 20 forks source link

Remove unused variables and refactor `GENE_LIST` #437

Closed joverlee521 closed 9 months ago

joverlee521 commented 9 months ago

Noted in previous PRs that that the GENES and GENES_SPACE_DELIMITED variables are not needed¹ or used in the workflow,² so refactor the GENE_LIST to be a hardcoded list of genes.

If we want to ensure that we do not miss any genes from the Nextclade dataset, we could parse out the gene names from the dataset's genome_annotation.gff file. However, I think that will over-complicate the Snakemake workflow so I'm leaving the hardcoded list.

¹ https://github.com/nextstrain/ncov-ingest/pull/372#discussion_r1046020969 ² https://github.com/nextstrain/ncov-ingest/pull/435#discussion_r1496332575

Checklist

joverlee521 commented 9 months ago

Tested locally by running the debug config

nextstrain build \
    --envdir ~/Repos/env.d/aws/ \
    --image nextstrain/ncov-ingest \
    . \ 
        --configfile config/debug_sample_genbank.yaml \
        --config s3_dst=s3://nextstrain-data/files/ncov/open/branch/update-gene-list

All translation_*.fasta.zst files have been uploaded to s3://nextstrain-data/files/ncov/open/branch/update-gene-list