Prefilter step died with easy-cluster

ivandatasci commented 6 months ago

Expected Behavior

easy-cluster should finish execution without errors

Current Behavior

mmseqs easy-cluster errors and crashes with:

Error: Prefilter step died
Error: Search died

Steps to Reproduce (for bugs)

a) Get the input sequences which here I have split into 3 files to fit into Github's upload limits:

my_seqs.1of3.fasta.gz my_seqs.2of3.fasta.gz my_seqs.3of3.fasta.gz

b) Consolidate the 3 chunks:

zcat my_seqs.*.fasta.gz > /tmp/my_seqs.fasta

c) Execute and expose the bug:

/opt/mmseqs/bin/mmseqs easy-cluster \
/tmp/my_seqs.fasta /tmp/my_seqs/result /tmp/my_seqs/tmp \
--dbtype 2 --threads 8 --local-tmp /tmp \
--cluster-reassign -s 7.5 --cov-mode 0 -c 0.98 --cluster-mode 2 --min-seq-id 0.99 -v 1

and the bug is shown below

MMseqs Output (for bugs)

/tmp/my_seqs/tmp/5280277461515018798/clu_tmp/18196956704942050314/nucleotide_clustering.sh: line 48:  4723 Segmentation fault      (core dumped) $RUNNER "$MMSEQS" prefilter "$QUERY" "$INPUT" "${TMP_PATH}/pref" ${PREFILTER_PAR}
Error: Prefilter step died
Error: Search died

Context

In my hands, this bug is exposed only when the number of nucleotide sequences is in the order of millions. For small sets (thousands) the execution completes uneventfully. I have tried the precompiled AVX2 version, the SSE4.1 version, I have tried my own compilation of the latest release (15-6f452, Oct 31 2023) and also the latest github version (f6c9880) and other variations. All attempts led to the exact same bug.

I have tried also with other three input datasets. All four crash in the same way. All four are in the order of 3 to 4million nucleotide sequences.

When I subset the sequences to about 200K sequences, easy-cluster runs to completion.

Your Environment

I am running this on an AWS EC2 instance of type g4dn (128GB RAM). Here you go:

$ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  32
  On-line CPU(s) list:   0-31
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
    CPU family:          6
    Model:               85
    Thread(s) per core:  2
    Core(s) per socket:  16
    Socket(s):           1
    Stepping:            7
    BogoMIPS:            4999.98
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht
                         syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid ap
                         erfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt
                         tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch pti fsgsbase
                         tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke avx512_vnni
Virtualization features: 
  Hypervisor vendor:     KVM
  Virtualization type:   full
Caches (sum of all):     
  L1d:                   512 KiB (16 instances)
  L1i:                   512 KiB (16 instances)
  L2:                    16 MiB (16 instances)
  L3:                    35.8 MiB (1 instance)
NUMA:                    
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-31
Vulnerabilities:         
  Gather data sampling:  Unknown: Dependent on hypervisor status
  Itlb multihit:         KVM: Mitigation: VMX unsupported
  L1tf:                  Mitigation; PTE Inversion
  Mds:                   Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
  Meltdown:              Mitigation; PTI
  Mmio stale data:       Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
  Retbleed:              Vulnerable
  Spec rstack overflow:  Not affected
  Spec store bypass:     Vulnerable
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Retpolines, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
  Srbds:                 Not affected
  Tsx async abort:       Not affected

milot-mirdita commented 6 months ago

@martin-steinegger

* thread #5, stop reason = EXC_BAD_ACCESS (code=1, address=0x5a7684002)
    frame #0: 0x0000000100169b58 mmseqs`CacheFriendlyOperations<2u>::findDuplicates(this=0x0000600000c08090, output=0x00000005a72a2336, outputSize=580749, computeTotalScore=true) at CacheFriendlyOperations.cpp:229:50
   226                  const unsigned int element = tmpElementBuffer[n].id;
   227                  const unsigned int hashBinElement = element >> (MASK_0_5_BIT);
   228                  output[doubleElementCount].id    = element;
-> 229                  output[doubleElementCount].count = duplicateBitArray[hashBinElement];
   230                  output[doubleElementCount].diagonal = tmpElementBuffer[n].diagonal;
   231

(lldb) p hashBinElement
(const unsigned int) 742456
(lldb) p duplicateBitArray
(unsigned char *) 0x00000005b8008000 ""
(lldb) p doubleElementCount
(size_t) 581514
(lldb) p duplicateBitArray
(unsigned char *) 0x00000005b8008000 ""
(lldb) p output[doubleElementCount]
error: Couldn't apply expression side effects : Couldn't dematerialize a result variable: couldn't read its memory
(lldb) p output
(CounterResult *) 0x00000005a72a2336
(lldb) p duplicateBitArray[hashBinElement]
(unsigned char) '\x01'

milot-mirdita commented 6 months ago

Also interesting, a lot of over represented k-mers (same prefix/suffix?)

Query database size: 3083342 type: Nucleotide
Estimated memory consumption: 12G
Target database size: 1541671 type: Nucleotide
Index table k-mer threshold: 0 at k-mer size 15
Index table: counting k-mers
[=================================================================] 100.00% 1.54M 2m 38s 193ms
Index table: Masked residues: 141067
Index table: fill
[=================================================================] 100.00% 1.54M 1m 10s 152ms
Index statistics
Entries:          516344842
DB size:          11146 MB
Avg k-mer size:   0.480884
Top 10 k-mers
    GGGCTCAGGATTCTG 1282098
    CTGCTCTGGGCGCGT 1167098
    TGAGCTGGGCATGAG 1134437
    AAGTTCCTCACTCGG 1086133
    CTGTAAGCTGCTCGT 966085
    AGCTACATTGATCGC 943599
    CAGCGACACTGCTTG 913837
    CCTCGCACGCCTGAG 883990
    CCTCTGCACTCGCTG 827574
    GAGCTGGAAGCTGGT 791516

ivandatasci commented 6 months ago

Also interesting, a lot of over represented k-mers (same prefix/suffix?)

Query database size: 3083342 type: Nucleotide
Estimated memory consumption: 12G
Target database size: 1541671 type: Nucleotide
Index table k-mer threshold: 0 at k-mer size 15
Index table: counting k-mers
[=================================================================] 100.00% 1.54M 2m 38s 193ms
Index table: Masked residues: 141067
Index table: fill
[=================================================================] 100.00% 1.54M 1m 10s 152ms
Index statistics
Entries:          516344842
DB size:          11146 MB
Avg k-mer size:   0.480884
Top 10 k-mers
    GGGCTCAGGATTCTG   1282098
    CTGCTCTGGGCGCGT   1167098
    TGAGCTGGGCATGAG   1134437
    AAGTTCCTCACTCGG   1086133
    CTGTAAGCTGCTCGT   966085
    AGCTACATTGATCGC   943599
    CAGCGACACTGCTTG   913837
    CCTCGCACGCCTGAG   883990
    CCTCTGCACTCGCTG   827574
    GAGCTGGAAGCTGGT   791516

@milot-mirdita

That is correct: these millions of sequences are derived from a small set of common ancestor sequences. In short, they are very similar to one another in some portions.

milot-mirdita commented 6 months ago

We have observed before that it's possible to get the prefilter to crash with many very similar sequences. We will have to investigate how we can deal with this and don't have a solution or workaround for now though.

soedinglab / MMseqs2