Closed liamfriar closed 1 year ago
Dear @liamfriar,
-
is a gap character which HyPhy treats as a valid nucleotide alignment charatcer (so will every other program). HyPhy assumes that your sequences are codon-aligned (for codon-aware) alignments, so, the sequence that is labeled as having premature stop codons has "out-of-frame" indels (not multiples of 3).
How did you prepare your alignment?
Best, Sergei
Hi @spond , I prepared these alignments using MAFFT
, implemented in OrthoFinder
using the (https://github.com/davidemms/OrthoFinder).
orthofinder -y -d -z -M msa -o $outdir -f $indir
Okay, that makes total sense. I was confused because the stdout
output that I copied in my previous message does have some capitalized TGA
s, which made me think that was the problem. Looking at that full alignment now, I think I see the problem on my end.
If it will help anyone else in the future: This alignment was generated using some likely pseudogenes and then I pruned the alignment and tree of those pseudogenes before entering them into various hyphy
analyses. And that worked fine for most of the alignments, but for some of them, pseudogenes contained frameshift insertions. So, in the case from above, the alignment has a single -
inserted into all of the predicted intact genes where the insertion is in the pseudogenes. So, maybe I need to re-run my alignments using only predicted intact genes.
It appears that even with the predicted pseudogenes removed, I still end up with some nucleotide MSAs that have lengths not divisible by 3, even though the number of nucleotides is divisible by 3. So, I think I need to look into codon-aware alignment, maybe using PAL2NAL
.
Using MACSE codon-aware alignment fixed this problem. (https://github.com/ranwez/MACSE_V2_PIPELINES)
I ran the following command:
hyphy busted --alignment $msa_file --tree ${tree_dir}/${OG}.pruned_tree.labeled.txt --branches test --output $outfile > $stdoutfile
and received an error with the std output attached OG0000370.busted_out.txt . Using the nucleotide string in that attached doc:
It looks like the first identified internal stop codon starts after 360 characters which is divisible by 3, so the stop codon would be in-frame. However, one of those previous 360 characters is a
-
, so it would really be 359 nucleotides preceding, and thus the stop codon is out of frame and not a stop codon. Maybe I am missing something, but it seems like busted is incorrectly counting the-
as a nucleotide when determining where codons begin? Furthermore, it counts 1,204 nucleotides which is the length including-
symbols, whereas the length without-
's is 1,185.HYPHY 2.5.48(MP) for Linux on x86_64
installed using conda