Closed gavinmdouglas closed 1 year ago
Dear @gavinmdouglas,
For file formats that do not allow metadata (e.g. FASTA), HyPhy uses a heuristic to guess which type of data it is. Basically it's a frequency-based heuristic, and your dataset seems to have a large number of ACGT
amino-acids, so the heuristic breaks down.
A = 0.1047103051746964
C = 0.009619637328615606
D = 0.03409258440218512
E = 0.01831785345717487
F = 0.02355152587350925
G = 0.1734114698510993
H = 0.0119047619047622
I = 0.05900781365177676
K = 0.01337903582485723
L = 0.06999115435647334
M = 0.01337903582485723
N = 0.04072681704260974
P = 0.07795223352498308
Q = 0.02333038478549588
R = 0.01581158779301263
S = 0.05576441102757022
T = 0.1551304732419257
V = 0.06247235736399598
W = 0.009619637328615606
Y = 0.02782692024178385
Not sure what is going on, but it looks like there may be a lot of motif repeats there, e.g. PTGIT
In any case, you can use an obscure HyPhy
tag to force it to read the FASTA file as protein sequences. Just add the following line to the top of your FASTA file: $BASESET :BASE20
hyphy /Users/sergei/Development/hyphy-analyses/codon-msa/post-msa.bf --protein-msa /Users/sergei/Downloads/group_2532.fna_protein.msa --nucleotide-sequences /Users/sergei/Downloads/group_2532.fna_nuc.fas --output /Users/sergei/Downloads/group_2532.fna_codon.msa
compress: Yes
code: Universal
Analysis Description
--------------------
Map a protein MSA back onto nucleotide sequences
- __Requirements__: A protein MSA and the corresponding nucleotide alignment
- __Citation__: TBD
- __Written by__: Sergei L Kosakovsky Pond
- __Contact Information__: spond@temple.edu
- __Analysis Version__: 0.01
Load the protein MSA
Load the unaligned in-frame sequences
[UNIQUE SEQUENCES] Retained 10 unique sequences
Best, Sergei
Thanks for clarifying, @spond. Yes it seems like this is an unusual amino acid sequence, and I can appreciate that edge cases like this that cause problems are very rare (~0.03% of bacterial gene alignments I have been processing of around 1.8 million).
However, given that 'Q' for instance is not a valid IUPAC nucleotide symbol, but is an amino acid symbol, perhaps information like that could be used to improve the heuristic?
Thanks,
Gavin
Dear @gavinmdouglas,
That's a great suggestion! If you have more alignments that fail the heuristic, could you send them along? I'll see if adjusting it to use the information like you suggest (disjoint characters like Q
and I
) will improve auto-detection.
Best, Sergei
Hey Sergei,
Absolutely, you can see all of the alignments that failed due to this problem attached!
All the best,
Gavin
Hi there,
Thanks for making this extremely useful tool!
I ran into this error when running this command (following the workflow described here):
The input alignment must contain protein data in call to assert(alignments.AlphabetType(grnJDpsA.alphabet)==utility.getGlobalValue('terms.amino_acid'), error_msg);
I identified that this is because there is one line of my input protein MSA that is all gap characters. When this line is removed the command finishes correctly. I am also able to comment out that check in
alignments.ReadProteinDataSet
(temporarily) to avoid this issue.I am using HYPHY 2.5.36(MP) for Linux on x86_64
I have attached the two input files and the log output.
full_log.txt group_2532.fna_nuc.fas.gz group_2532.fna_protein.msa.gz
Thanks!
Gavin