Open ivandatasci opened 6 months ago
@martin-steinegger
* thread #5, stop reason = EXC_BAD_ACCESS (code=1, address=0x5a7684002)
frame #0: 0x0000000100169b58 mmseqs`CacheFriendlyOperations<2u>::findDuplicates(this=0x0000600000c08090, output=0x00000005a72a2336, outputSize=580749, computeTotalScore=true) at CacheFriendlyOperations.cpp:229:50
226 const unsigned int element = tmpElementBuffer[n].id;
227 const unsigned int hashBinElement = element >> (MASK_0_5_BIT);
228 output[doubleElementCount].id = element;
-> 229 output[doubleElementCount].count = duplicateBitArray[hashBinElement];
230 output[doubleElementCount].diagonal = tmpElementBuffer[n].diagonal;
231
(lldb) p hashBinElement
(const unsigned int) 742456
(lldb) p duplicateBitArray
(unsigned char *) 0x00000005b8008000 ""
(lldb) p doubleElementCount
(size_t) 581514
(lldb) p duplicateBitArray
(unsigned char *) 0x00000005b8008000 ""
(lldb) p output[doubleElementCount]
error: Couldn't apply expression side effects : Couldn't dematerialize a result variable: couldn't read its memory
(lldb) p output
(CounterResult *) 0x00000005a72a2336
(lldb) p duplicateBitArray[hashBinElement]
(unsigned char) '\x01'
Also interesting, a lot of over represented k-mers (same prefix/suffix?)
Query database size: 3083342 type: Nucleotide
Estimated memory consumption: 12G
Target database size: 1541671 type: Nucleotide
Index table k-mer threshold: 0 at k-mer size 15
Index table: counting k-mers
[=================================================================] 100.00% 1.54M 2m 38s 193ms
Index table: Masked residues: 141067
Index table: fill
[=================================================================] 100.00% 1.54M 1m 10s 152ms
Index statistics
Entries: 516344842
DB size: 11146 MB
Avg k-mer size: 0.480884
Top 10 k-mers
GGGCTCAGGATTCTG 1282098
CTGCTCTGGGCGCGT 1167098
TGAGCTGGGCATGAG 1134437
AAGTTCCTCACTCGG 1086133
CTGTAAGCTGCTCGT 966085
AGCTACATTGATCGC 943599
CAGCGACACTGCTTG 913837
CCTCGCACGCCTGAG 883990
CCTCTGCACTCGCTG 827574
GAGCTGGAAGCTGGT 791516
Also interesting, a lot of over represented k-mers (same prefix/suffix?)
Query database size: 3083342 type: Nucleotide Estimated memory consumption: 12G Target database size: 1541671 type: Nucleotide Index table k-mer threshold: 0 at k-mer size 15 Index table: counting k-mers [=================================================================] 100.00% 1.54M 2m 38s 193ms Index table: Masked residues: 141067 Index table: fill [=================================================================] 100.00% 1.54M 1m 10s 152ms Index statistics Entries: 516344842 DB size: 11146 MB Avg k-mer size: 0.480884 Top 10 k-mers GGGCTCAGGATTCTG 1282098 CTGCTCTGGGCGCGT 1167098 TGAGCTGGGCATGAG 1134437 AAGTTCCTCACTCGG 1086133 CTGTAAGCTGCTCGT 966085 AGCTACATTGATCGC 943599 CAGCGACACTGCTTG 913837 CCTCGCACGCCTGAG 883990 CCTCTGCACTCGCTG 827574 GAGCTGGAAGCTGGT 791516
@milot-mirdita
That is correct: these millions of sequences are derived from a small set of common ancestor sequences. In short, they are very similar to one another in some portions.
We have observed before that it's possible to get the prefilter to crash with many very similar sequences. We will have to investigate how we can deal with this and don't have a solution or workaround for now though.
Expected Behavior
easy-cluster should finish execution without errors
Current Behavior
mmseqs easy-cluster errors and crashes with:
Steps to Reproduce (for bugs)
a) Get the input sequences which here I have split into 3 files to fit into Github's upload limits:
my_seqs.1of3.fasta.gz my_seqs.2of3.fasta.gz my_seqs.3of3.fasta.gz
b) Consolidate the 3 chunks:
c) Execute and expose the bug:
and the bug is shown below
MMseqs Output (for bugs)
Context
In my hands, this bug is exposed only when the number of nucleotide sequences is in the order of millions. For small sets (thousands) the execution completes uneventfully. I have tried the precompiled AVX2 version, the SSE4.1 version, I have tried my own compilation of the latest release (15-6f452, Oct 31 2023) and also the latest github version (f6c9880) and other variations. All attempts led to the exact same bug.
I have tried also with other three input datasets. All four crash in the same way. All four are in the order of 3 to 4million nucleotide sequences.
When I subset the sequences to about 200K sequences, easy-cluster runs to completion.
Your Environment
I am running this on an AWS EC2 instance of type g4dn (128GB RAM). Here you go: