Open ktmeaton opened 1 week ago
The default k-mer size for nucleotides is 15, which indeed requires more than 8GB of RAM.
You can use reduce the k-mer size to 13 (-k 13
) so that the k-mer data structures fit in less than 4GB RAM.
Additionally (unrelated to memory use), I recommend to disable spaced k-mers for nucleotides (--spaced-kmer-mode 0
). This is an issue we have discovered with regarding sensitivity. We are reworking this currently.
Thank you so much for your help! I understand now what's happening with the memory, and -k 13
fixes it!
mmseqs easy-cluster test.txt mmseqs tmp --split-memory-limit 2G --threads 1 -k 13
Regarding --spaced-kmer-mode 0
, I'm finding that setting is fragmenting my clusters. I wonder if this is at all related to #489?
In this example data, the following 4 sequences are identical except for position 10, which has a T
in seq1
and an A
in everything else.
>seq1
CGACGTCAGTGCAGTCGCTAACGTGGCAG
>seq2
CGACGTCAGAGCAGTCGCTTACGTGGCAG
>seq3
CGACGTCAGAGCAGTCGCTTACGTGGCAG
>seq4
CGACGTCAGAGCAGTCGCTTACGTGGCAG
When I run with --spaced-kmer-mode 1
, I get the desired clustering result (all are grouped together in one cluster).
mmseqs easy-cluster snp_example.fasta spaced_1 tmp --spaced-kmer-mode 1
# spaced_1_cluster.tsv
seq2 seq2
seq2 seq1
seq2 seq3
seq2 seq4
When I run with --spaced-kmer-mode 0
, the sequences are split into two clusters based on that SNP.
mmseqs easy-cluster snp_example.fasta spaced_0 tmp --spaced-kmer-mode 0
# spaced_0_cluster.tsv
seq1 seq1
seq2 seq2
seq2 seq3
seq2 seq4
I can't seem to find any other parameters that will group them all back together again. I am still reading through the manual, but just wanted to document this example data in the meantime.
Expected Behavior
I'm running cluster unit tests on a very small file. I'm trying to control the memory usage with
--split-memory-limit
, but it errors unless I give it at least 9GB of memory. This seems like a very disproportionate amount of memory. I haven't been able to create a test dataset that will run with less than 9G of memory, which suggests to me this might be a bug?Current Behavior
When I try to cluster a very small number of sequences with less than 9G of memory, I get the error:
Please use a computer with more main memory
.I've tried
easy-cluster
andcreatedb
+cluster
. I've tried running through docker and conda, and I've tried the latest docker image from master. So far they all raise this error.Steps to Reproduce (for bugs)
The following commands raise the error. It can only be fixed by using at least 9G of memory (
--split-memory-limit 9G
).MMseqs Output (for bugs)
Context
Your Environment
Include as many relevant details about the environment you experienced the bug in.
conda
and docker image from GitHub package registry.