Open JonStargaryen opened 2 years ago
It's a new week and we serve new (RNA) data. Now it's working again for our production data.
However, the example above is still broken. Is there some auto-detect step that determine the database type from the provided sequences? The only thing that changed is the sequence file.
I think this sequence broke the nucl-prot detection heuristic:
>5DO4_3
GGGAAXAAAGXUGAAGUXXUUAXXX
MMseqs2 assumes that the unknown residue for nucleotides is N
and not X
. The heuristic for differing between the two is as follows:
ATGCUN
if all of these are nucleotide sequences, declare the database to be a nucleotide database.
You can disable the heuristic and fix the database to be a nucleotide db by passing --dbtype 2
to easy-search
(in this case the search
entry in the .params
file.
Thanks for the pointer to use --dbtype 2
. I've tried it (and a combination of other options & software versions) but no luck. The behavior doesn't change, the database is still indexed as amino acid sequences and search is still failing.
For now, I can try to sort the sequence file in a fashion that prevents problematic sequences from appearing at the top.
I think the issue here is that the detection of NA/prot goes wrong at indexing time. So what is also needed is the equivalent to --dbtype 2
but for indexing.
Expected Behavior
Current Behavior
Steps to Reproduce (for bugs)
/opt/mmseqs/MMseqs2-App/docker-compose/databases/pdb_rna_sequence.fasta
with some test RNA sequences:/opt/mmseqs/MMseqs2-App/docker-compose/databases/pdb_rna_sequence.params
:docker-compose up --no-color
MMseqs Output (for bugs)
For comparison, logs looked like this when the RNA indexing was working:
Please note the difference wrt the first argument,
-k 15
, and--max-seq-len 10000
.When I run any RNA query, e.g.:
The result is this:
This is the healthy state:
Context
On a weekly basis, some new sequences are added to the FASTA file. This week, RNA searching stopped working and seems to report errors for all queries. We didn't make any version changes and have been using mmseqs via MMseqs2-App v5 for a long time. Last week's data now shows the same behavior (but was fine a week ago).
Indexing for protein and DNA sequences still works as expected. DNA k-mers have size 15. I find it peculiar that the top k-mers for RNA sequences don't contain any U (or T), as it was the case previously.
Your Environment
Include as many relevant details about the environment you experienced the bug in.
Git commit used (The string after "MMseqs Version:" when you execute MMseqs without any parameters): e1a1c1226ef22ac3d0da8e8f71adb8fd2388a249
Which MMseqs version was used (Statically-compiled, self-compiled, Homebrew, etc.): docker image distributed by MMseqs2-App
For self-compiled and Homebrew: Compiler and Cmake versions used and their invocation: N/A
Server specifications (especially CPU support for AVX2/SSE and amount of system memory): 2 CPUs, 16 GB memory
Operating system and version: Ubuntu 20.04
Thank you in advance!