veg / hyphy

HyPhy: Hypothesis testing using Phylogenies
http://www.hyphy.org
Other
201 stars 68 forks source link

BUSTED incorrectly finding internal STOP codons #1615

Closed liamfriar closed 1 year ago

liamfriar commented 1 year ago

I ran the following command:

hyphy busted --alignment $msa_file --tree ${tree_dir}/${OG}.pruned_tree.labeled.txt --branches test --output $outfile > $stdoutfile

and received an error with the std output attached OG0000370.busted_out.txt . Using the nucleotide string in that attached doc:

*** PROBLEM WITH SEQUENCE ' AnabaenaCylindricaPCC7122_ANACY_RS00085' (1204 nt long, stop codons shown in capital letters)

atgcgtgtaattttaatgacaggtaaaggtggcgtgggtaaaacctcagtcgctgccgcaactgggcttcgttgtgcagaactcggctaccggacactagttttaagtacagatcctgctcactctctggcagacagctttgatatagaattggggcatgatgctcgacaagtacgtcccaatttgtggggtgcagaactcgacgcgttgcaagaattagaaggtaattggggtgctgtaaaacgttacattacccaagttttgcaagcacggggtttagacgggatacaggcggaagaattggcga-ttttaccaggcatggacgagattttcggcttggtaaggatgaaacgtcactaTGATGAaggggaattTGAcgttttgataatTGAttctgcccctaccggtactgcactacgtctgttaagtttaccagaagttggcggctggtatatgcgacgtttttacaaaccttttcaaaacatctctgtcgcacttcgacctctggtTGAacctattttTAGaccaattgctggtttttctttaccagaTAGagaagtgatggatgcgccttaTGAattttaTGAacaaatagaggcactggaaaaagtattaacTGATAAtactcaaacctcagtccgtctggtcaccaacccagaaaaaatggtgatTAAagaatctctccgcgcccatgcttatctcagtttataTAAtgtagcgacagatttaattgtagcTAAtcgcattattccccaagaagtTAAagatccatttttccaacgttggaaagagaatcaagaacaataccgccaagaaattcaTGATAActttcaccctttacctgtgaaagaagtacctctttattcTGAagaaatgtgcggtttggcagcattagagagactcaaggaaactctttac---ccagaTGAagatccaactcagatttattacaaagaaacaactatTAGagttgtacaagagaaTAAt---caatacagcttagagctttatttacctggcattccTAAaaatcaagttcaactaagTAAaagtggcgaTGAattaaatattactattggTAAtcatcgccgTAAtttggttttaccacaggctttagcagcgctgcaaccatcaggtgcaaagatggaagaTGAttatctaaaaatccgcttttcTGAaattgtaaaggtg------------
Error:
The input alignment must have the number of sites that is divisible by 3 and must not contain stop codons in call to assert(busted.codon_filter.sites*3==busted.codon_data.sites, error_msg);

It looks like the first identified internal stop codon starts after 360 characters which is divisible by 3, so the stop codon would be in-frame. However, one of those previous 360 characters is a -, so it would really be 359 nucleotides preceding, and thus the stop codon is out of frame and not a stop codon. Maybe I am missing something, but it seems like busted is incorrectly counting the - as a nucleotide when determining where codons begin? Furthermore, it counts 1,204 nucleotides which is the length including - symbols, whereas the length without -'s is 1,185.

HYPHY 2.5.48(MP) for Linux on x86_64 installed using conda

spond commented 1 year ago

Dear @liamfriar,

- is a gap character which HyPhy treats as a valid nucleotide alignment charatcer (so will every other program). HyPhy assumes that your sequences are codon-aligned (for codon-aware) alignments, so, the sequence that is labeled as having premature stop codons has "out-of-frame" indels (not multiples of 3).

How did you prepare your alignment?

Best, Sergei

liamfriar commented 1 year ago

Hi @spond , I prepared these alignments using MAFFT, implemented in OrthoFinder using the (https://github.com/davidemms/OrthoFinder). orthofinder -y -d -z -M msa -o $outdir -f $indir

Okay, that makes total sense. I was confused because the stdout output that I copied in my previous message does have some capitalized TGAs, which made me think that was the problem. Looking at that full alignment now, I think I see the problem on my end.

If it will help anyone else in the future: This alignment was generated using some likely pseudogenes and then I pruned the alignment and tree of those pseudogenes before entering them into various hyphy analyses. And that worked fine for most of the alignments, but for some of them, pseudogenes contained frameshift insertions. So, in the case from above, the alignment has a single - inserted into all of the predicted intact genes where the insertion is in the pseudogenes. So, maybe I need to re-run my alignments using only predicted intact genes.

liamfriar commented 1 year ago

It appears that even with the predicted pseudogenes removed, I still end up with some nucleotide MSAs that have lengths not divisible by 3, even though the number of nucleotides is divisible by 3. So, I think I need to look into codon-aware alignment, maybe using PAL2NAL.

liamfriar commented 1 year ago

Using MACSE codon-aware alignment fixed this problem. (https://github.com/ranwez/MACSE_V2_PIPELINES)