Open josemduarte opened 5 years ago
Thank you for reporting this. I see whats happening here. The prefilter sets the k-mer threshold to 130. None of the k-mers in the sequence reaches that threshold. So no k-mer is extracted. I will think about a fix.
Any updates on this? I've also run into the same issue creating a profile database from a MSA, and using the profile database to search my target database.
I've tried this with version 13-45111 and it works fine now. I think the issue can be closed @martin-steinegger
Hi @josemduarte and @martin-steinegger , I am still getting the same issue with MMseqs2 Version: 13-45111
No k-mer could be extracted for the database tmp/15694179607629846192/input_step_redundancy. Maybe the sequences length is less than 14 residues.
The command I am using are
mmseqs createdb *.fas DB mmseqs cluster DB DB_clu tmp
Most of my sequences are very small and the maximum length is 20 nucleotide residues.
Hi @martin-steinegger, I'm getting a similar (though not identical) error while trying to run indexdb on a nucleotide database that I would like to search repeatedly.
mmseqs createdb target_sequences.fa target_sequencesDB
#The 'target_sequences.fa' contains 67,880 nucleotide fasta records, with lengths ranging from 987 bp to 12,136 bp.
mmseqs createindex target_sequencesDB tmp --spaced-kmer-mode 0 -k 0 -s 7.5 --search-type 3
#I also tried to run the createindex with the parameters --max-seq-len 15000 and --mask 0 and received similar errors to the ones showed below.
Createdb works fine, but indexdb crashes. These are the last few lines of the output:
splitsequence target_sequencesDB tmp/12611708828474015781/nucl_split_seq --max-seq-len 10000 --sequence-overlap 0 --sequence-split-mode 1 --create-lookup 0 --threads 64 --compressed 0 -v 3
[=================================================================] 67.68K 0s 28ms Time for merging to nucl_split_seq_h: 0h 0m 0s 59ms Time for merging to nucl_split_seq: 0h 0m 0s 49ms Time for processing: 0h 0m 0s 257ms indexdb tmp/12611708828474015781/nucl_split_seq target_sequencesDB --seed-sub-mat nucl:nucleotide.out,aa:VTML80.out -k 0 --alph-size nucl:5,aa:21 --comp-bias-corr 1 --max-seq-len 10000 --max-seqs 300 --mask 1 --mask-lower-case 0 --spaced-kmer-mode 0 -s 7.5 --k-score 2147483647 --check-compatible 0 --search-type 3 --split 0 --split-memory-limit 0 -v 3 --threads 64
Estimated memory consumption: 1G Write VERSION (0) Write META (1) Write SCOREMATRIX3MER (4) Write SCOREMATRIX2MER (3) Write SCOREMATRIXNAME (2) Write SPACEDPATTERN (23) Write DBR1INDEX (5) Write DBR1DATA (6) Write DBR2INDEX (7) Write DBR2DATA (8) Write HDR1INDEX (18) Write HDR1DATA (19) Write HDR2INDEX (20) Write HDR2DATA (21) Write GENERATOR (22) Index table: counting k-mers [=================================================================] 67.72K 1s 204ms Index table: Masked residues: 41849 No k-mer could be extracted for the database tmp/12611708828474015781/nucl_split_seq. Maybe the sequences length is less than 14 residues. Error: indexdb died
MMseqs version: 3513001d33301f7eaaf58e60a1376488ff017354 Operating system and version: CentOS Linux 7 (Core)
Hello,
Just to report that this issue keeps happening with short sequences in version 14-7e284
.
Here is the log file section:
Query database size: 1 type: Aminoacid
Estimated memory consumption: 977M
Target database size: 1 type: Aminoacid
Index table k-mer threshold: 154 at k-mer size 6
Index table: counting k-mers
[=================================================================] 1 0s 5ms
Index table: Masked residues: 52
No k-mer could be extracted for the database OG29842_tmp/9235789383789574915/clu_tmp/8036944701986152555/input_step_redundancy.
I do not know if it has been addressed previously, but I have the feeling it's due to short sequences, mine are 55-60 AAs long.
We still never implemented a fix for this, the issue is that for very small sets and very low sensitivity settings you can run into the issue that no similar k-mers can be generated given the (very high) similarity threshold.
The error doesn't happen once sensitivity or database size increases anymore so it wasn't really a priority to fix.
You can get around the issue with a somewhat ugly hack of using the old single step clustering approach instead of the cascaded clustering:
mmseqs easy-cluster --single-step-clustering -s 6 (or higher) [other clustering params] ...
This will become very slow for larger sets but shouldn't matter for small sets.
Hello @milot-mirdita,
Thank you for your quick answer.
Yes, I may not be using MMseqs2 for what it was designed and that is why I run into this issue. Since I have not seen this described anywhere else, I will describe my situation here in case it is of help to anyone coming to this issue:
I have a set of pre-clustered sequences from which I want to find the best representative sequence (I just want one representative, the "centroid" of them). Since I want them into a single cluster I decided to set the settings to the lowest sensitivity and that led to the abovementioned issue:
mmseqs easy-cluster example.fasta clusterRes tmp --min-seq-id 0 -c 0 --cov-mode 1
Even then, in a small percentage of cases (7.14% approx), I got more than 1 cluster (up to 4 in some cases). Thus, I may not be using MMseqs2 correctly, but I have not found any better alternative.
Thanks again for your time!
mmseqs easy-cluster crashes with a very simple file as an input:
Steps to Reproduce (for bugs)
Command to reproduce (with above as input fasta file
file.fa
):MMseqs Output (for bugs)
These are the last few lines of the output:
Your Environment