soedinglab / MMseqs2

MMseqs2: ultra fast and sensitive search and clustering suite
https://mmseqs.com
MIT License
1.46k stars 198 forks source link

Error with very small inputs: Please use a computer with more main memory #901

Open ktmeaton opened 1 week ago

ktmeaton commented 1 week ago

Expected Behavior

I'm running cluster unit tests on a very small file. I'm trying to control the memory usage with --split-memory-limit, but it errors unless I give it at least 9GB of memory. This seems like a very disproportionate amount of memory. I haven't been able to create a test dataset that will run with less than 9G of memory, which suggests to me this might be a bug?

Current Behavior

When I try to cluster a very small number of sequences with less than 9G of memory, I get the error: Please use a computer with more main memory.

>seq1
GTTTATTTTCTCCTGTTAAATTGTCAGGCCAGAACGGCCAGTTTTCACGGGGTTCAGATA
>seq2
GTTTATTTTCTCCTGTTAAATTGTCAGGCCAGAACGGCCAGTTTTCACGGGGTTCAGATA
>seq3
TATCTGAACCCCGTGAAAACTGGCCGTTCTGGCCTGACAATTTAACAGGAGAAAATAAAC

I've tried easy-cluster and createdb + cluster. I've tried running through docker and conda, and I've tried the latest docker image from master. So far they all raise this error.

Steps to Reproduce (for bugs)

The following commands raise the error. It can only be fixed by using at least 9G of memory (--split-memory-limit 9G).

# Docker
docker run --rm -v $(pwd):/data ghcr.io/soedinglab/mmseqs2:15-6f452 easy-cluster /data/test.txt /data/mmseqs tmp --split-memory-limit 8G --threads 1

# Conda
micromamba create -n mmseqs2 bioconda::mmseqs2=15.6f452
micromamba run -n mmseqs2 mmseqs easy-cluster test.txt mmseqs tmp --split-memory-limit 8G --threads 1

MMseqs Output (for bugs)

Context

Your Environment

Include as many relevant details about the environment you experienced the bug in.

milot-mirdita commented 5 days ago

The default k-mer size for nucleotides is 15, which indeed requires more than 8GB of RAM. You can use reduce the k-mer size to 13 (-k 13) so that the k-mer data structures fit in less than 4GB RAM.

Additionally (unrelated to memory use), I recommend to disable spaced k-mers for nucleotides (--spaced-kmer-mode 0). This is an issue we have discovered with regarding sensitivity. We are reworking this currently.

ktmeaton commented 3 days ago

Memory

Thank you so much for your help! I understand now what's happening with the memory, and -k 13 fixes it!

mmseqs easy-cluster test.txt mmseqs tmp --split-memory-limit 2G --threads 1 -k 13

Spaced Kmers

Regarding --spaced-kmer-mode 0, I'm finding that setting is fragmenting my clusters. I wonder if this is at all related to #489?

In this example data, the following 4 sequences are identical except for position 10, which has a T in seq1 and an A in everything else.

>seq1
CGACGTCAGTGCAGTCGCTAACGTGGCAG
>seq2
CGACGTCAGAGCAGTCGCTTACGTGGCAG
>seq3
CGACGTCAGAGCAGTCGCTTACGTGGCAG
>seq4
CGACGTCAGAGCAGTCGCTTACGTGGCAG

When I run with --spaced-kmer-mode 1, I get the desired clustering result (all are grouped together in one cluster).

mmseqs easy-cluster snp_example.fasta spaced_1 tmp --spaced-kmer-mode 1

# spaced_1_cluster.tsv
seq2    seq2
seq2    seq1
seq2    seq3
seq2    seq4

When I run with --spaced-kmer-mode 0, the sequences are split into two clusters based on that SNP.

mmseqs easy-cluster snp_example.fasta spaced_0 tmp --spaced-kmer-mode 0

# spaced_0_cluster.tsv
seq1    seq1
seq2    seq2
seq2    seq3
seq2    seq4

I can't seem to find any other parameters that will group them all back together again. I am still reading through the manual, but just wanted to document this example data in the meantime.