soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
MIT License
1.46k stars 198 forks source link

Segmentation fault in linclust #297

Open manock opened 4 years ago

manock commented 4 years ago

Hello,

Expected Behavior

Output clustering results.

Current Behavior

Segmentation in linclust.sh

Steps to Reproduce (for bugs)

mmseqs createdb seq.fa db/dbclust
mmseqs linclust db/dbclust clust_result tmp --max-seq-len 30000000

MMseqs Output (for bugs)

Please make sure to also post the complete output of MMseqs. You can use gist.github.com for large output.

kmermatcher db/dbclust tmp/16437734971973434362/pref --sub-mat nucl:nucleotide.out,aa:blosum62.out --alph-size 13 --min-seq-id 0.9 --kmer-per-seq 21 --spaced-kmer-mode 0 --kmer-per-seq-scale 0 --adjust-kmer-len 0 --mask 0 --mask-lower-case 0 --cov-mode 0 -k 0 -c 0.8 --max-seq-len 30000000 --hash-shift 67 --split-memory-limit 0 --include-only-extendable 0 --ignore-multi-kmer 0 --threads 36 --compressed 0 -v 3

kmermatcher db/dbclust tmp/16437734971973434362/pref --sub-mat nucl:nucleotide.out,aa:blosum62.out --alph-size 13 --min-seq-id 0.9 --kmer-per-seq 21 --spaced-kmer-mode 0 --kmer-per-seq-scale 0 --adjust-kmer-len 0 --mask 0 --mask-lower-case 0 --cov-mode 0 -k 0 -c 0.8 --max-seq-len 30000000 --hash-shift 67 --split-memory-limit 0 --include-only-extendable 0 --ignore-multi-kmer 0 --threads 36 --compressed 0 -v 3

Database size: 140204 type: Nucleotide

Generate k-mers list for 1 split
[=================================================================] 140.20K 1m 19s 398ms

Adjusted k-mer length 17
Sort kmer 0h 0m 0s 95ms
Sort by rep. sequence 0h 0m 0s 17ms
Time for fill: 0h 0m 0s 29ms
Time for merging to pref: 0h 0m 0s 21ms
Time for processing: 0h 1m 20s 543ms
rescorediagonal db/dbclust db/dbclust tmp/16437734971973434362/pref tmp/16437734971973434362/pref_rescore1 --sub-mat nucl:nucleotide.out,aa:blosum62.out --rescore-mode 0 --wrapped-scoring 0 --filter-hits 0 -e 0.001 -c 0.8 -a 0 --cov-mode 0 --min-seq-id 0.9 --min-aln-len 0 --seq-id-mode 0 --add-self-matches 0 --sort-results 0 --db-load-mode 0 --threads 36 --compressed 0 -v 3

[=================================================================] 140.20K 2m 37s 793ms
Time for merging to pref_rescore1: 0h 0m 0s 35ms
Time for processing: 0h 2m 48s 60ms
clust db/dbclust tmp/16437734971973434362/pref_rescore1 tmp/16437734971973434362/pre_clust --cluster-mode 0 --max-iterations 1000 --similarity-type 2 --threads 36 --compressed 0 -v 3

Clustering mode: Set Cover
[=================================================================] 140.20K 0s 7ms
Sort entries
Find missing connections
Found 44 new connections.
Reconstruct initial order
[=================================================================] 140.20K 0s 7ms
Add missing connections
[=================================================================] 140.20K 0s 3ms

Time for read in: 0h 0m 0s 42ms
Total time: 0h 0m 0s 64ms

Size of the sequence database: 140204
Size of the alignment database: 140204
Number of clusters: 140160

Writing results 0h 0m 0s 28ms
Time for merging to pre_clust: 0h 0m 0s 22ms
Time for processing: 0h 0m 0s 144ms
createsubdb tmp/16437734971973434362/order_redundancy db/dbclust tmp/16437734971973434362/input_step_redundancy -v 3 --subdb-mode 1

Time for merging to input_step_redundancy: 0h 0m 0s 16ms
Time for processing: 0h 0m 0s 46ms
createsubdb tmp/16437734971973434362/order_redundancy tmp/16437734971973434362/pref tmp/16437734971973434362/pref_filter1 -v 3 --subdb-mode 1

Time for merging to pref_filter1: 0h 0m 0s 12ms
Time for processing: 0h 0m 0s 36ms
filterdb tmp/16437734971973434362/pref_filter1 tmp/16437734971973434362/pref_filter2 --filter-file tmp/16437734971973434362/order_redundancy

Filtering using file(s)
[=================================================================] 140.16K 0s 15ms
Time for merging to pref_filter2: 0h 0m 0s 35ms
Time for processing: 0h 0m 0s 92ms
align tmp/16437734971973434362/input_step_redundancy tmp/16437734971973434362/input_step_redundancy tmp/16437734971973434362/pref_filter2 tmp/16437734971973434362/aln --sub-mat nucl:nucleotide.out,aa:blosum62.out -a 0 --alignment-mode 2 --wrapped-scoring 0 -e 0.001 --min-seq-id 0.9 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0.8 --cov-mode 0 --max-seq-len 30000000 --comp-bias-corr 1 --realign 0 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca 1 --pcb 1.5 --score-bias 0 --gap-open 11 --gap-extend 1 --zdrop 40 --threads 36 --compressed 0 -v 3

Compute score and coverage
Query database size: 140160 type: Nucleotide
Target database size: 140160 type: Nucleotide
Calculation of alignments
[============tmp/16437734971973434362/linclust.sh: line 75: 22654 Segmentation fault      $RUNNER "$MMSEQS" "${ALIGN_MODULE}" "$INPUT" "$INPUT" "$RESULTDB" "${TMP_PATH}/aln" ${ALIGNMENT_PAR}
Error: Alignment step died

Context

I have a Fasta with about 140000 sequences which range from a few thousands nucleotides to about 20 millions. The memory consumption is fine throughout the mmseqs steps. But at some point during the align phase, a segmentation fault is thrown. It doesn't seem like a memory problem. I tried with the easy-clust workflow and the cluster module, both of which fail at the same point.

Your Environment

Include as many relevant details about the environment you experienced the bug in.

martin-steinegger commented 4 years ago

I never tried to cluster such long sequences. Can you isolate the issue?

manock commented 4 years ago

The error is happening in the call for ksw_extz2_sse in BandedNucleotideAligner::align.

I have made a few tests with increasing number of sequences in the database. I tested up to 50 000 sequences and it went fine.

I have also done a test including the longer sequence and about 5000 other sequences and it went fine.

jianye00 commented 1 year ago

I also encountered segment fault issue when clustering long nucleotide sequences (up to 99 million bases). Does anyone have luck with long sequences?

==========Invalid database read for id=4294967295, database index=dump/9317603370475534640/input_step_redundancy.index getSeqLen: local id (4294967295) >= db size (8247802) =====================Error: Offset step died [===dump/16153251853230858118/linclust/13629425479186879042/linclust.sh: line 76: 195145 Segmentation fault (core dumped) $RUNNER "$MMSEQS" "${ALIGN_MODULE}" "$INPUT" "$INPUT" "$RESULTDB" "${TMP_PATH}/aln" ${ALIGNMENT_PAR} Error: Alignment step died Error: linclust died