Segmentation fault in linclust

manock commented 4 years ago

Hello,

Expected Behavior

Output clustering results.

Current Behavior

Segmentation in linclust.sh

Steps to Reproduce (for bugs)

mmseqs createdb seq.fa db/dbclust
mmseqs linclust db/dbclust clust_result tmp --max-seq-len 30000000

MMseqs Output (for bugs)

Please make sure to also post the complete output of MMseqs. You can use gist.github.com for large output.

kmermatcher db/dbclust tmp/16437734971973434362/pref --sub-mat nucl:nucleotide.out,aa:blosum62.out --alph-size 13 --min-seq-id 0.9 --kmer-per-seq 21 --spaced-kmer-mode 0 --kmer-per-seq-scale 0 --adjust-kmer-len 0 --mask 0 --mask-lower-case 0 --cov-mode 0 -k 0 -c 0.8 --max-seq-len 30000000 --hash-shift 67 --split-memory-limit 0 --include-only-extendable 0 --ignore-multi-kmer 0 --threads 36 --compressed 0 -v 3

kmermatcher db/dbclust tmp/16437734971973434362/pref --sub-mat nucl:nucleotide.out,aa:blosum62.out --alph-size 13 --min-seq-id 0.9 --kmer-per-seq 21 --spaced-kmer-mode 0 --kmer-per-seq-scale 0 --adjust-kmer-len 0 --mask 0 --mask-lower-case 0 --cov-mode 0 -k 0 -c 0.8 --max-seq-len 30000000 --hash-shift 67 --split-memory-limit 0 --include-only-extendable 0 --ignore-multi-kmer 0 --threads 36 --compressed 0 -v 3

Database size: 140204 type: Nucleotide

Generate k-mers list for 1 split
[=================================================================] 140.20K 1m 19s 398ms

Adjusted k-mer length 17
Sort kmer 0h 0m 0s 95ms
Sort by rep. sequence 0h 0m 0s 17ms
Time for fill: 0h 0m 0s 29ms
Time for merging to pref: 0h 0m 0s 21ms
Time for processing: 0h 1m 20s 543ms
rescorediagonal db/dbclust db/dbclust tmp/16437734971973434362/pref tmp/16437734971973434362/pref_rescore1 --sub-mat nucl:nucleotide.out,aa:blosum62.out --rescore-mode 0 --wrapped-scoring 0 --filter-hits 0 -e 0.001 -c 0.8 -a 0 --cov-mode 0 --min-seq-id 0.9 --min-aln-len 0 --seq-id-mode 0 --add-self-matches 0 --sort-results 0 --db-load-mode 0 --threads 36 --compressed 0 -v 3

[=================================================================] 140.20K 2m 37s 793ms
Time for merging to pref_rescore1: 0h 0m 0s 35ms
Time for processing: 0h 2m 48s 60ms
clust db/dbclust tmp/16437734971973434362/pref_rescore1 tmp/16437734971973434362/pre_clust --cluster-mode 0 --max-iterations 1000 --similarity-type 2 --threads 36 --compressed 0 -v 3

Clustering mode: Set Cover
[=================================================================] 140.20K 0s 7ms
Sort entries
Find missing connections
Found 44 new connections.
Reconstruct initial order
[=================================================================] 140.20K 0s 7ms
Add missing connections
[=================================================================] 140.20K 0s 3ms

Time for read in: 0h 0m 0s 42ms
Total time: 0h 0m 0s 64ms

Size of the sequence database: 140204
Size of the alignment database: 140204
Number of clusters: 140160

Writing results 0h 0m 0s 28ms
Time for merging to pre_clust: 0h 0m 0s 22ms
Time for processing: 0h 0m 0s 144ms
createsubdb tmp/16437734971973434362/order_redundancy db/dbclust tmp/16437734971973434362/input_step_redundancy -v 3 --subdb-mode 1

Time for merging to input_step_redundancy: 0h 0m 0s 16ms
Time for processing: 0h 0m 0s 46ms
createsubdb tmp/16437734971973434362/order_redundancy tmp/16437734971973434362/pref tmp/16437734971973434362/pref_filter1 -v 3 --subdb-mode 1

Time for merging to pref_filter1: 0h 0m 0s 12ms
Time for processing: 0h 0m 0s 36ms
filterdb tmp/16437734971973434362/pref_filter1 tmp/16437734971973434362/pref_filter2 --filter-file tmp/16437734971973434362/order_redundancy

Filtering using file(s)
[=================================================================] 140.16K 0s 15ms
Time for merging to pref_filter2: 0h 0m 0s 35ms
Time for processing: 0h 0m 0s 92ms
align tmp/16437734971973434362/input_step_redundancy tmp/16437734971973434362/input_step_redundancy tmp/16437734971973434362/pref_filter2 tmp/16437734971973434362/aln --sub-mat nucl:nucleotide.out,aa:blosum62.out -a 0 --alignment-mode 2 --wrapped-scoring 0 -e 0.001 --min-seq-id 0.9 --min-aln-len 0 --seq-id-mode 0 --alt-ali 0 -c 0.8 --cov-mode 0 --max-seq-len 30000000 --comp-bias-corr 1 --realign 0 --max-rejected 2147483647 --max-accept 2147483647 --add-self-matches 0 --db-load-mode 0 --pca 1 --pcb 1.5 --score-bias 0 --gap-open 11 --gap-extend 1 --zdrop 40 --threads 36 --compressed 0 -v 3

Compute score and coverage
Query database size: 140160 type: Nucleotide
Target database size: 140160 type: Nucleotide
Calculation of alignments
[============tmp/16437734971973434362/linclust.sh: line 75: 22654 Segmentation fault      $RUNNER "$MMSEQS" "${ALIGN_MODULE}" "$INPUT" "$INPUT" "$RESULTDB" "${TMP_PATH}/aln" ${ALIGNMENT_PAR}
Error: Alignment step died

Context

I have a Fasta with about 140000 sequences which range from a few thousands nucleotides to about 20 millions. The memory consumption is fine throughout the mmseqs steps. But at some point during the align phase, a segmentation fault is thrown. It doesn't seem like a memory problem. I tried with the easy-clust workflow and the cluster module, both of which fail at the same point.

Your Environment

Include as many relevant details about the environment you experienced the bug in.

Git commit used (The string after "MMseqs Version:" when you execute MMseqs without any parameters): 11.e1a1c
Which MMseqs version was used (Statically-compiled, self-compiled, Homebrew, etc.): conda latest version.
Server specifications (especially CPU support for AVX2/SSE and amount of system memory): Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz (x36) Memory : 72 GB
Operating system and version: Amazon linux 2

martin-steinegger commented 4 years ago

I never tried to cluster such long sequences. Can you isolate the issue?

manock commented 4 years ago

The error is happening in the call for ksw_extz2_sse in BandedNucleotideAligner::align.

I have made a few tests with increasing number of sequences in the database. I tested up to 50 000 sequences and it went fine.

I have also done a test including the longer sequence and about 5000 other sequences and it went fine.

jianye00 commented 1 year ago

I also encountered segment fault issue when clustering long nucleotide sequences (up to 99 million bases). Does anyone have luck with long sequences?

==========Invalid database read for id=4294967295, database index=dump/9317603370475534640/input_step_redundancy.index getSeqLen: local id (4294967295) >= db size (8247802) =====================Error: Offset step died [===dump/16153251853230858118/linclust/13629425479186879042/linclust.sh: line 76: 195145 Segmentation fault (core dumped) $RUNNER "$MMSEQS" "${ALIGN_MODULE}" "$INPUT" "$INPUT" "$RESULTDB" "${TMP_PATH}/aln" ${ALIGNMENT_PAR} Error: Alignment step died Error: linclust died

soedinglab / MMseqs2