BWA indexing of human genome very long runtime

dkioroglou commented 3 years ago

Issue

Based on the BWA documentation (source):

Indexing the human genome sequences takes 3 hours with bwtsw algorithm.

However, BWA indexing of the human genome, on our machine, hasn't been completed even after 2-days of runtime.

Before proceeding with much longer runtimes, could you please tell me if this issue makes sense?

Commands related to issue

I have tried the following commands for indexing:

bwa index $REFERENCE

bwa index -a bwtsw $REFERENCE

The reference has been used either in its .gz compressed format or uncompressed.

Each command was executed by SLURM with the following options:

--ntasks=1
--cpus-per-task=1
--mem-per-cpu=60G

General info

BWA version:

0.7.17-r1188

BWA installation:

conda install -c bioconda bwa

Unmasked human reference used:

Homo_sapiens.GRCh38.dna.toplevel.fa.gz
file size: 1.1GB (compressed)

Operating system:

CentOS 7

dkioroglou commented 3 years ago

Update

The following command:

bwa index -a bwtsw -b 375000000 $REFERENCE

indexes the human genome is 19 hours and the RAM usage tops-out at 100GB.

Although, I could close the issue, I would like to keep it open as I'm curious to know under what conditions the following lines of the bwa documentation hold true:

Indexing the human genome sequences takes 3 hours with bwtsw algorithm.
With bwtsw algorithm, 5GB memory is required for indexing the complete human genome sequences.

lh3 commented 3 years ago

Never ever use human toplevel fasta files. See http://lh3.github.io/2017/11/13/which-human-reference-genome-to-use.

lh3 / bwa