nextstrain / nextclade

Viral genome alignment, mutation calling, clade assignment, quality checks and phylogenetic placement
https://clades.nextstrain.org
MIT License
214 stars 58 forks source link

Frame shift `ORF7a:empty range` is confusing #1042

Open corneliusroemer opened 1 year ago

corneliusroemer commented 1 year ago

I remember we talked about this on Slack, we seem to sometimes output ORF7a:empty range into the tsv - which is not ideal since it breaks assumptions.

Maybe we can at least document what it means, and maybe fix it. I think this occurs if there's a frame shift in the stop codon?

Here are a few sample Genbank URLs:

OV623283.1
OV731935.1
OW351812.1
OU254502.1

empty_range.fasta.txt

ivan-aksamentov commented 1 year ago

Potentially related Slack threads:

Results of Nextclade run are attached

nextclade.json.txt nextclade.tsv.txt

ivan-aksamentov commented 1 year ago
Click to show the result ```bash jq '.results[0].frameShifts' nextclade.json | prettyjson ``` ```yml - geneName: ORF7a nucRel: begin: 365 end: 366 nucAbs: begin: 27758 end: 27759 codon: begin: 122 end: 122 gapsLeading: codon: begin: 101 end: 122 gapsTrailing: codon: begin: 122 end: 122 codonMask: begin: 101 end: 122 ```
Click to show the result ```bash jq '[ .results[0].aaDeletions[] | select( .gene == "ORF7a" ) ]' nextclade.json | prettyjson ``` ```yml - gene: ORF7a refAA: F codon: 100 codonNucRange: begin: 27693 end: 27696 refContext: ATTTTTCTT queryContext: ATTT----- contextNucRange: begin: 27690 end: 27699 nucSubstitutions: (empty array) nucDeletions: - start: 27694 length: 64 - gene: ORF7a refAA: L codon: 101 codonNucRange: begin: 27696 end: 27699 refContext: TTTCTTATT queryContext: T-------- contextNucRange: begin: 27693 end: 27702 nucSubstitutions: (empty array) nucDeletions: - start: 27694 length: 64 - gene: ORF7a refAA: I codon: 102 codonNucRange: begin: 27699 end: 27702 refContext: CTTATTGTT queryContext: --------- contextNucRange: begin: 27696 end: 27705 nucSubstitutions: (empty array) nucDeletions: - start: 27694 length: 64 - gene: ORF7a refAA: V codon: 103 codonNucRange: begin: 27702 end: 27705 refContext: ATTGTTGCG queryContext: --------- contextNucRange: begin: 27699 end: 27708 nucSubstitutions: (empty array) nucDeletions: - start: 27694 length: 64 - gene: ORF7a refAA: A codon: 104 codonNucRange: begin: 27705 end: 27708 refContext: GTTGCGGCA queryContext: --------- contextNucRange: begin: 27702 end: 27711 nucSubstitutions: (empty array) nucDeletions: - start: 27694 length: 64 - gene: ORF7a refAA: A codon: 105 codonNucRange: begin: 27708 end: 27711 refContext: GCGGCAATA queryContext: --------- contextNucRange: begin: 27705 end: 27714 nucSubstitutions: (empty array) nucDeletions: - start: 27694 length: 64 - gene: ORF7a refAA: I codon: 106 codonNucRange: begin: 27711 end: 27714 refContext: GCAATAGTG queryContext: --------- contextNucRange: begin: 27708 end: 27717 nucSubstitutions: (empty array) nucDeletions: - start: 27694 length: 64 - gene: ORF7a refAA: V codon: 107 codonNucRange: begin: 27714 end: 27717 refContext: ATAGTGTTT queryContext: --------- contextNucRange: begin: 27711 end: 27720 nucSubstitutions: (empty array) nucDeletions: - start: 27694 length: 64 - gene: ORF7a refAA: F codon: 108 codonNucRange: begin: 27717 end: 27720 refContext: GTGTTTATA queryContext: --------- contextNucRange: begin: 27714 end: 27723 nucSubstitutions: (empty array) nucDeletions: - start: 27694 length: 64 - gene: ORF7a refAA: I codon: 109 codonNucRange: begin: 27720 end: 27723 refContext: TTTATAACA queryContext: --------- contextNucRange: begin: 27717 end: 27726 nucSubstitutions: (empty array) nucDeletions: - start: 27694 length: 64 - gene: ORF7a refAA: T codon: 110 codonNucRange: begin: 27723 end: 27726 refContext: ATAACACTT queryContext: --------- contextNucRange: begin: 27720 end: 27729 nucSubstitutions: (empty array) nucDeletions: - start: 27694 length: 64 - gene: ORF7a refAA: L codon: 111 codonNucRange: begin: 27726 end: 27729 refContext: ACACTTTGC queryContext: --------- contextNucRange: begin: 27723 end: 27732 nucSubstitutions: (empty array) nucDeletions: - start: 27694 length: 64 - gene: ORF7a refAA: C codon: 112 codonNucRange: begin: 27729 end: 27732 refContext: CTTTGCTTC queryContext: --------- contextNucRange: begin: 27726 end: 27735 nucSubstitutions: (empty array) nucDeletions: - start: 27694 length: 64 - gene: ORF7a refAA: F codon: 113 codonNucRange: begin: 27732 end: 27735 refContext: TGCTTCACA queryContext: --------- contextNucRange: begin: 27729 end: 27738 nucSubstitutions: (empty array) nucDeletions: - start: 27694 length: 64 - gene: ORF7a refAA: T codon: 114 codonNucRange: begin: 27735 end: 27738 refContext: TTCACACTC queryContext: --------- contextNucRange: begin: 27732 end: 27741 nucSubstitutions: (empty array) nucDeletions: - start: 27694 length: 64 - gene: ORF7a refAA: L codon: 115 codonNucRange: begin: 27738 end: 27741 refContext: ACACTCAAA queryContext: --------- contextNucRange: begin: 27735 end: 27744 nucSubstitutions: (empty array) nucDeletions: - start: 27694 length: 64 - gene: ORF7a refAA: K codon: 116 codonNucRange: begin: 27741 end: 27744 refContext: CTCAAAAGA queryContext: --------- contextNucRange: begin: 27738 end: 27747 nucSubstitutions: (empty array) nucDeletions: - start: 27694 length: 64 - gene: ORF7a refAA: R codon: 117 codonNucRange: begin: 27744 end: 27747 refContext: AAAAGAAAG queryContext: --------- contextNucRange: begin: 27741 end: 27750 nucSubstitutions: (empty array) nucDeletions: - start: 27694 length: 64 - gene: ORF7a refAA: K codon: 118 codonNucRange: begin: 27747 end: 27750 refContext: AGAAAGACA queryContext: --------- contextNucRange: begin: 27744 end: 27753 nucSubstitutions: (empty array) nucDeletions: - start: 27694 length: 64 - gene: ORF7a refAA: T codon: 119 codonNucRange: begin: 27750 end: 27753 refContext: AAGACAGAA queryContext: --------- contextNucRange: begin: 27747 end: 27756 nucSubstitutions: (empty array) nucDeletions: - start: 27694 length: 64 - gene: ORF7a refAA: E codon: 120 codonNucRange: begin: 27753 end: 27756 refContext: ACAGAATGA queryContext: --------A contextNucRange: begin: 27750 end: 27759 nucSubstitutions: (empty array) nucDeletions: - start: 27694 length: 64 - gene: ORF7a refAA: * codon: 121 codonNucRange: begin: 27756 end: 27759 refContext: GAATGATTG queryContext: -----ATTG contextNucRange: begin: 27753 end: 27762 nucSubstitutions: (empty array) nucDeletions: - start: 27694 length: 64 ```
Click to show the result ```bash jq '[ .results[0].qc.stopCodons ]' nextclade.json | prettyjson ``` ```yml - score: 0 status: good stopCodons: (empty array) totalStopCodons: 0 stopCodonsIgnored: (empty array) totalStopCodonsIgnored: 0 ```
ivan-aksamentov commented 1 year ago

I'll leave the scientific analysis and documenting it to you, but let's see what can be improved in terms of the algorithm or user interface.