lskatz / mashtree

:deciduous_tree: Create a tree using Mash distances
GNU General Public License v3.0
156 stars 24 forks source link

Bootstrapping requires a lot of resources? #81

Open goeckeritz opened 1 year ago

goeckeritz commented 1 year ago

Hi Lee,

Thanks so much for making this handy tool. I'm trying to use it to calculate genetic differences for some very heterozygous plant species. I am trying to run it in accurate mode (--min-depth 0) with 100 bootstrap replicates, but it seems to be taking quite a long time. So I wanted to essentially do a sanity check, and ask you if I'm fundamentally misunderstanding how the bootstrapping is done.

According to the log file, mashtree successfully identifies the valleys of very rare kmers of several sizes and is able to calculate the distances within about five hours (at least 48 threads, 4G - 8G per thread). But when it gets to bootstrapping... it takes more than a day or two. In fact I haven't seen it finish yet. I guess I was surprised by this because I thought the sketches were created first and random subsampling then occurs with the already-made sketches to create the bootstrapped tree; I wasn't expecting this process to be especially CPU time intensive.

For some context, I'm using the conda installed version for 1.4.5 -- https://anaconda.org/bioconda/mashtree And when I initially install it, the mashtree_bootstrap.pl script is angry about not having List::MoreUtils installed, which I then install in the same environment with this conda recipe -- https://anaconda.org/bioconda/perl-list-moreutils

I mention this because it seems bootstrapping does its multi-threading using perl, and I wonder if there is an issue in my installation. I know you don't oversee the conda or docker installations but I think if I could at least understand the fundamentals of how it is bootstrapping that may give me an idea on how to fix/handle this.

I'm identifying kmers in about 250Gb of compressed short read (Illumina) fastq.gz data. Here's the script I'm running, in case that's helpful:

!/bin/sh --login

SBATCH -J case0

SBATCH --nodes=1

SBATCH --ntasks=52

SBATCH --mem-per-cpu=8g

SBATCH --time=48:00:00

SBATCH -o /mnt/scratch/goeckeri/mashtree/mashtree_original_cov_per_hap%j

module purge conda activate mashtree

cd /mnt/scratch/goeckeri/mashtree/

mashtree_bootstrap.pl --outmatrix case0_dist --reps 100 --numcpus 52 --file-of-files case0_files.txt -- --sort-order random \ --genomesize 750000000 --mindepth 0 --kmerlength 25 --sketch-size 10000 > case0_bs_tree.dnd

Thanks so much for your time -- I appreciate any help/coaching you might be able to give!

Kindly, Charity

lskatz commented 1 year ago

Hi, I am sorry for the frustration. I don't immediately see anything wrong with how you are running it. Your command looks good and it looks like it's in the right framework. That said, it is possible that you are hitting some disk I/O bandwidth issues if you are running 52 CPUs. Even though you are on the scratch drive and even though the way it runs is embarrassingly parallel, you could be maxing out how much your disk can handle. I would recommend seeing what happens if you run it with 8 CPUs and/or if you can set --tempdir /dev/shm (if you have enough RAM).

Let me know how it goes.