Protein sequence wrongly identified as DNA

gavinmdouglas commented 1 year ago

Hi there,

Thanks for making this extremely useful tool!

I ran into this error when running this command (following the workflow described here):

hyphy hyphy-analyses/codon-msa/post-msa.bf --protein-msa group_2532.fna_protein.msa \
                                                                                 --nucleotide-sequences group_2532.fna_nuc.fas \
                                                                                 --output test \
                                                                                 &> full_log.txt

The input alignment must contain protein data in call to assert(alignments.AlphabetType(grnJDpsA.alphabet)==utility.getGlobalValue('terms.amino_acid'), error_msg);

I identified that this is because there is one line of my input protein MSA that is all gap characters. When this line is removed the command finishes correctly. I am also able to comment out that check in alignments.ReadProteinDataSet (temporarily) to avoid this issue.

I am using HYPHY 2.5.36(MP) for Linux on x86_64

I have attached the two input files and the log output.

full_log.txt group_2532.fna_nuc.fas.gz group_2532.fna_protein.msa.gz

Thanks!

Gavin

spond commented 1 year ago

Dear @gavinmdouglas,

For file formats that do not allow metadata (e.g. FASTA), HyPhy uses a heuristic to guess which type of data it is. Basically it's a frequency-based heuristic, and your dataset seems to have a large number of ACGT amino-acids, so the heuristic breaks down.

A = 0.1047103051746964
C = 0.009619637328615606
D = 0.03409258440218512
E = 0.01831785345717487
F = 0.02355152587350925
G = 0.1734114698510993
H = 0.0119047619047622
I = 0.05900781365177676
K = 0.01337903582485723
L = 0.06999115435647334
M = 0.01337903582485723
N = 0.04072681704260974
P = 0.07795223352498308
Q = 0.02333038478549588
R = 0.01581158779301263
S = 0.05576441102757022
T = 0.1551304732419257
V = 0.06247235736399598
W = 0.009619637328615606
Y = 0.02782692024178385

Not sure what is going on, but it looks like there may be a lot of motif repeats there, e.g. PTGIT

In any case, you can use an obscure HyPhy tag to force it to read the FASTA file as protein sequences. Just add the following line to the top of your FASTA file: $BASESET :BASE20

hyphy /Users/sergei/Development/hyphy-analyses/codon-msa/post-msa.bf --protein-msa /Users/sergei/Downloads/group_2532.fna_protein.msa  --nucleotide-sequences /Users/sergei/Downloads/group_2532.fna_nuc.fas  --output  /Users/sergei/Downloads/group_2532.fna_codon.msa  
compress: Yes
code: Universal

Analysis Description
--------------------
 Map a protein MSA back onto nucleotide sequences 

- __Requirements__: A protein MSA and the corresponding nucleotide alignment

- __Citation__: TBD

- __Written by__: Sergei L Kosakovsky Pond

- __Contact Information__: spond@temple.edu

- __Analysis Version__: 0.01

Load the protein MSA
Load the unaligned in-frame sequences
[UNIQUE SEQUENCES] Retained 10 unique  sequences

Best, Sergei

group_2532.fna_codon.msa.zip

gavinmdouglas commented 1 year ago

Thanks for clarifying, @spond. Yes it seems like this is an unusual amino acid sequence, and I can appreciate that edge cases like this that cause problems are very rare (~0.03% of bacterial gene alignments I have been processing of around 1.8 million).

However, given that 'Q' for instance is not a valid IUPAC nucleotide symbol, but is an amino acid symbol, perhaps information like that could be used to improve the heuristic?

Thanks,

Gavin

spond commented 1 year ago

Dear @gavinmdouglas,

That's a great suggestion! If you have more alignments that fail the heuristic, could you send them along? I'll see if adjusting it to use the information like you suggest (disjoint characters like Q and I) will improve auto-detection.

Best, Sergei

gavinmdouglas commented 1 year ago

Hey Sergei,

Absolutely, you can see all of the alignments that failed due to this problem attached!

All the best,

Gavin

failed_alignments.tar.gz

veg / hyphy

Protein sequence wrongly identified as DNA #1574