ksahlin / strobealign

Aligns short reads using dynamic seed size with strobemers
MIT License
128 stars 16 forks source link

map overflow - large reference db #285

Closed alejandrorgijon closed 1 year ago

alejandrorgijon commented 1 year ago

Hi! I am currently creating an index that includes a 62G file with >20k genomes. I am running my analysis on Expanse as follows:

#!/bin/bash

#SBATCH -p shared
#SBATCH -A get144 
#SBATCH -J Index
#SBATCH --mail-user alejandro.gijon@su.se
#SBATCH -o Index.out
#SBATCH -e Index.err
#SBATCH -t 20:00:00
#SBATCH --mem=0
#SBATCH -N 1
#SBATCH -n 90

conda activate strobealign
strobealign --create-index all_genomes_rep.fna -r 150 --threads 90

However, there is a problem of mapping overflow while running it:

This is strobealign 0.9.0
Time reading reference: 105.52 s
Reference size: 65222.60 Mbp (4989301 contigs; largest: 8.91 Mbp)
Indexing ...
strobealign: robin_hood::map overflow

Do you have a hint for me how to make it work?

Thanks a lot for your support, Alejandro Rodríguez-Gijón

marcelm commented 1 year ago

Hi, the issue should be fixed now and it should be possible to use strobealign with your 62 Gbp reference (when you compile from source or wait for the next release). Memory usage was also reduced quite considerable, but note that there is still a factor of 5 between the size of the reference and the amount of RAM required. So in your case you would need approx 300 GB RAM. This will likely be reduced further in future strobealign versions.

alejandrorgijon commented 1 year ago

It worked now (and very fast), thanks!