nextstrain / dengue

Nextstrain build for dengue virus
https://nextstrain.org/dengue
8 stars 10 forks source link

Add gene coverage columns during ingest workflow #36

Closed j23414 closed 3 months ago

j23414 commented 4 months ago

Description of proposed changes

pathogen-repo-guide (6)

Several approaches were explored to add {gene}_coverage columns during ingest workflow (as opposed to during phylogenetic workflow). The different approaches were summarized by @joverlee521 and @jameshadfield and copied here for context of this PR, along with added comments from @j23414 in [comments]:

  1. Generalize RSV's extend-metadata to take gene coordinates as input to calculate gene coverage. This will require gene coordinates to be maintained in the config YAML. Follow current pattern of outputing gene coverage columns that can be used for filter [ Opened an issue: https://github.com/nextstrain/rsv/issues/57 ~ @j23414 ]
  2. Use Nextclade's failedCdses column to determine if E gene has coverage. Outputs E gene included with True/False that can be used for filter. [I went ahead and appended the failedCdses column from Nextclade, so we can still use this method for other genes ~ @j23414 ]
  3. We briefly talked about whether it would be possible for Nextclade to output {CDS}_coverage columns in addition to the full genome coverage column. This will allow the workflow to use the Nextclade columns for filter without having to maintain the gene coordinates or parse the dataset GFF file to get the coordinates
  4. Use the output (translated) CDS alignments from nextclade to add columns to the metadata with amino acid length or similar. This could then be used via augur filter --query .... This approach would be made obsolete by (3), but it's pretty easy to do right now. I [@jameshadfield] think it's preferable to (1) in both the case of compound CDSs and the case where a genome alignment extends both sides of the CDS but actually has very little coverage over the CDS itself. [This PR is following approach 4 ~ @j23414 ]

New Metadata

To view the new "E_coverage" columns, feel free to download the new metadata at:

wget https://data.nextstrain.org/files/workflows/dengue/metadata_all.tsv.zst
zstd -d metadata_all.tsv.zst 

The new {gene}_coverage columns are the rightmost columns.

Related issue(s)

Checklist

j23414 commented 3 months ago

Thanks @joverlee521 ! This PR is ready for the next round of reviews