veg / hyphy

HyPhy: Hypothesis testing using Phylogenies
http://www.hyphy.org
Other
205 stars 69 forks source link

Error Message - Data Format Issues #1567

Closed clairelavergne closed 1 year ago

clairelavergne commented 1 year ago

Hi,

HyPhy (both the web server on datamonkey and the GUI version) keeps rejecting my text files by saying that there's a stop codon, but accepts the FASTA file of the alignments. I can't figure out why copy-pasting my alignments from FASTA into the text file would cause this issue. I'm attaching my file drd4combo.txt and the error log I got log.txt. I'm following the guidelines posted here to format my text file. Any insight would be greatly appreciated!

spond commented 1 year ago

Dear @clairelavergne,

Interesting example! One of your sequences has a not-multiple-of-3 indel (which is most likely an alignment artifact) as shown below

image

The TGT gets split into TG- and --T with intervening gaps. TG- is interpreted by HyPhy as TGN, i.e. TG{A,C,G,T}. One of these: TGA, is a stop codon, hence you get the warning.

Generally, these "partial" codons are abiological and are entirely due to alignment issues (e.g. nucleotide-level tools). Our (very simple) codon-aware MSA workflow, for example, generates "fused" codons (as you would expect).

image

I attach the MSA for your reference.

I am on the fence about whether or not the current HyPhy behavior is a feature or a bug. On the one hand, TG- is not a stop codon (it can be resolved to one, but also to non-stop codons). On the other hand, TG- probably should not occur in a "proper" codon-aware in-frame alignment, so it serves as a diagnostic of a potential data quality problem. I think I'll update the language and turn it into a WARNING, rather than an error (in the next release).

Best, Sergei

drd4combo.msa.txt

clairelavergne commented 1 year ago

Hi @spond,

Thank you so much for your reply! I hadn't realized how HyPhy interpreted partial codons. I definitely agree that users should be made aware if they have this kind of issue in their data - I thought I'd taken care of all of the alignment artifacts, but clearly not!

Kind regards, Claire