soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
GNU General Public License v3.0
1.36k stars 190 forks source link

Crash in easy-cluster at prefilter step "no k-mer could be extracted" #149

Open josemduarte opened 5 years ago

josemduarte commented 5 years ago

mmseqs easy-cluster crashes with a very simple file as an input:

>UPI0005E38868_574-614
EKVDQNTADITTNTNSINQNTTDIATNTTSINNLSDSITTL
>A0A2T6WYU8_88-128
ENVSQNTADITTNTNSINQNTTDIATNTTNINNLSDSITTL
>UPI0005DBD517_138-178
EKVDQNTADITTNTNSINQNTTDIATNTTSINNLSNSVTTL
>Q9F2D8_129-169
EKVDQNTADITTNTNSINQNTTDIATNTTSINNLSNSVTTL
>A0A315FWJ3_364-404
EKVDQNTADITTNTDSINQNTTDIATNTTNINSLSNSVTTL
>UPI0005DCDA99_510-550
ENVSQNTADITTNTNSINQNTTDIATNTTSINNLSDSITTL
>UPI000DA3EDE3_467-507
ETVDQNTADIAANTTSINQNTTDIAANTTNINNLSDSVTTL
>G4C802_559-599
ENVSQNTTDITANTDSINQNTTDIATNTTSINNLSNSVTTL
>UPI0004F83B94_392-432
DSINQNTTDIAANTTSINQNTTDIATNTTNINNLSDSITTL
>A0A379V5F6_560-600
ENVSQNTTDITANTDSINQNTTDIATNTTSINNLSNSVTTL
>UPI000B8EBDA4_511-549
INQNTTDIAANTTSINQNTTDIATNTTNINNLSDSVTTL
>UPI0009A94CF4_362-416
EKVDQNTADITANTDSINQNTTDIAANTTSINQNTADIAANTTNINNLSDSVTTL
>UPI000459DB58_363-417
EKVDQNTADITTNTDSINQNTTDIAANTASINQNTTDIATNTTNINSLSNSVTTL
>UPI0009AE3E57_364-418
EKVDQNTADITTNTDSINQNTTDIAANTASINQNTTDIATNTTNINSLSNSVTTL
>A0A2T9DBX0_63-117
EKVDQNTADITTNTDSINQNTTDIAANTASINQNTTDIATNTTNINSLSNSVTTL
>UPI0009B01E32_286-324
VTQNTTDIAANTDSINQNTTDIATNTTNINSLSDSVTTL
>UPI000BA995C1_364-416
EKVDQNTADITANTDSINQNTTDIAANTTSINQNTTEIATNTTNINSLSDSVT
>A0A2X5DK67_115-155
DSINQNTTDIAANTTSISQNTTDIAANTTNINSLSDSVTTL
>V7IUW3_392-428
DSINQNTTDIAANTTSINQNTTDIAANTTNINSLSDS

Steps to Reproduce (for bugs)

Command to reproduce (with above as input fasta file file.fa):

mmseqs easy-cluster file.fa /tmp/seqClustering /tmp/tmp-seqClustering --min-seq-id 0.90 -c 0.99 -s 8 --max-seqs 1000 --cluster-mode 1

MMseqs Output (for bugs)

These are the last few lines of the output:

Query database: /tmp/tmp-seqClustering/9466533042670559081/clu_tmp/5063784659926941655/input_step_redundancy(size=14)
Process prefiltering step 1 of 1

Index table k-mer threshold: 130
Index table: counting k-mers...

Index table: Masked residues: 251
No k-mer could be extracted for the database /tmp/tmp-seqClustering/9466533042670559081/clu_tmp/5063784659926941655/input_step_redundancy.
Maybe the sequences length is less than 14 residues.
Error: Prefilter step 0 died
Error: Search died

Your Environment

martin-steinegger commented 5 years ago

Thank you for reporting this. I see whats happening here. The prefilter sets the k-mer threshold to 130. None of the k-mers in the sequence reaches that threshold. So no k-mer is extracted. I will think about a fix.

etowahadams commented 3 years ago

Any updates on this? I've also run into the same issue creating a profile database from a MSA, and using the profile database to search my target database.

josemduarte commented 3 years ago

I've tried this with version 13-45111 and it works fine now. I think the issue can be closed @martin-steinegger

anganara commented 2 years ago

Hi @josemduarte and @martin-steinegger , I am still getting the same issue with MMseqs2 Version: 13-45111

No k-mer could be extracted for the database tmp/15694179607629846192/input_step_redundancy. Maybe the sequences length is less than 14 residues.

The command I am using are

mmseqs createdb *.fas DB mmseqs cluster DB DB_clu tmp

Most of my sequences are very small and the maximum length is 20 nucleotide residues.

tparket commented 2 years ago

Hi @martin-steinegger, I'm getting a similar (though not identical) error while trying to run indexdb on a nucleotide database that I would like to search repeatedly.

Commands to reproduce:

mmseqs createdb target_sequences.fa target_sequencesDB

#The 'target_sequences.fa' contains 67,880 nucleotide fasta records, with lengths ranging from 987 bp to 12,136 bp.

mmseqs createindex target_sequencesDB tmp --spaced-kmer-mode 0 -k 0 -s 7.5 --search-type 3

#I also tried to run the createindex with the parameters --max-seq-len 15000 and --mask 0 and received similar errors to the ones showed below.

MMseqs Output (for bugs)

Createdb works fine, but indexdb crashes. These are the last few lines of the output:

splitsequence target_sequencesDB tmp/12611708828474015781/nucl_split_seq --max-seq-len 10000 --sequence-overlap 0 --sequence-split-mode 1 --create-lookup 0 --threads 64 --compressed 0 -v 3

[=================================================================] 67.68K 0s 28ms Time for merging to nucl_split_seq_h: 0h 0m 0s 59ms Time for merging to nucl_split_seq: 0h 0m 0s 49ms Time for processing: 0h 0m 0s 257ms indexdb tmp/12611708828474015781/nucl_split_seq target_sequencesDB --seed-sub-mat nucl:nucleotide.out,aa:VTML80.out -k 0 --alph-size nucl:5,aa:21 --comp-bias-corr 1 --max-seq-len 10000 --max-seqs 300 --mask 1 --mask-lower-case 0 --spaced-kmer-mode 0 -s 7.5 --k-score 2147483647 --check-compatible 0 --search-type 3 --split 0 --split-memory-limit 0 -v 3 --threads 64

Estimated memory consumption: 1G Write VERSION (0) Write META (1) Write SCOREMATRIX3MER (4) Write SCOREMATRIX2MER (3) Write SCOREMATRIXNAME (2) Write SPACEDPATTERN (23) Write DBR1INDEX (5) Write DBR1DATA (6) Write DBR2INDEX (7) Write DBR2DATA (8) Write HDR1INDEX (18) Write HDR1DATA (19) Write HDR2INDEX (20) Write HDR2DATA (21) Write GENERATOR (22) Index table: counting k-mers [=================================================================] 67.72K 1s 204ms Index table: Masked residues: 41849 No k-mer could be extracted for the database tmp/12611708828474015781/nucl_split_seq. Maybe the sequences length is less than 14 residues. Error: indexdb died

Your Environment

MMseqs version: 3513001d33301f7eaaf58e60a1376488ff017354 Operating system and version: CentOS Linux 7 (Core)

sgarciah12 commented 10 months ago

Hello,

Just to report that this issue keeps happening with short sequences in version 14-7e284.

Here is the log file section:

Query database size: 1 type: Aminoacid
Estimated memory consumption: 977M
Target database size: 1 type: Aminoacid
Index table k-mer threshold: 154 at k-mer size 6 
Index table: counting k-mers
[=================================================================] 1 0s 5ms
Index table: Masked residues: 52
No k-mer could be extracted for the database OG29842_tmp/9235789383789574915/clu_tmp/8036944701986152555/input_step_redundancy.

I do not know if it has been addressed previously, but I have the feeling it's due to short sequences, mine are 55-60 AAs long.

milot-mirdita commented 10 months ago

We still never implemented a fix for this, the issue is that for very small sets and very low sensitivity settings you can run into the issue that no similar k-mers can be generated given the (very high) similarity threshold.

The error doesn't happen once sensitivity or database size increases anymore so it wasn't really a priority to fix.

You can get around the issue with a somewhat ugly hack of using the old single step clustering approach instead of the cascaded clustering:

mmseqs easy-cluster --single-step-clustering -s 6 (or higher) [other clustering params] ...

This will become very slow for larger sets but shouldn't matter for small sets.

sgarciah12 commented 10 months ago

Hello @milot-mirdita,

Thank you for your quick answer.

Yes, I may not be using MMseqs2 for what it was designed and that is why I run into this issue. Since I have not seen this described anywhere else, I will describe my situation here in case it is of help to anyone coming to this issue:

I have a set of pre-clustered sequences from which I want to find the best representative sequence (I just want one representative, the "centroid" of them). Since I want them into a single cluster I decided to set the settings to the lowest sensitivity and that led to the abovementioned issue:

mmseqs easy-cluster example.fasta clusterRes tmp --min-seq-id 0 -c 0 --cov-mode 1

Even then, in a small percentage of cases (7.14% approx), I got more than 1 cluster (up to 4 in some cases). Thus, I may not be using MMseqs2 correctly, but I have not found any better alternative.

Thanks again for your time!