Hi Lee,

Thanks so much for making this handy tool. I'm trying to use it to calculate genetic differences for some very heterozygous plant species. I am trying to run it in accurate mode (--min-depth 0) with 100 bootstrap replicates, but it seems to be taking quite a long time. So I wanted to essentially do a sanity check, and ask you if I'm fundamentally misunderstanding how the bootstrapping is done.

According to the log file, mashtree successfully identifies the valleys of very rare kmers of several sizes and is able to calculate the distances within about five hours (at least 48 threads, 4G - 8G per thread). But when it gets to bootstrapping... it takes more than a day or two. In fact I haven't seen it finish yet. I guess I was surprised by this because I thought the sketches were created first and random subsampling then occurs with the already-made sketches to create the bootstrapped tree; I wasn't expecting this process to be especially CPU time intensive.

For some context, I'm using the conda installed version for 1.4.5 -- https://anaconda.org/bioconda/mashtree And when I initially install it, the mashtree_bootstrap.pl script is angry about not having List::MoreUtils installed, which I then install in the same environment with this conda recipe -- https://anaconda.org/bioconda/perl-list-moreutils

I mention this because it seems bootstrapping does its multi-threading using perl, and I wonder if there is an issue in my installation. I know you don't oversee the conda or docker installations but I think if I could at least understand the fundamentals of how it is bootstrapping that may give me an idea on how to fix/handle this.

I'm identifying kmers in about 250Gb of compressed short read (Illumina) fastq.gz data. Here's the script I'm running, in case that's helpful:

!/bin/sh --login

SBATCH -J case0

SBATCH --nodes=1

SBATCH --ntasks=52

SBATCH --mem-per-cpu=8g

SBATCH --time=48:00:00

SBATCH -o /mnt/scratch/goeckeri/mashtree/mashtree_original_cov_per_hap%j

module purge conda activate mashtree

cd /mnt/scratch/goeckeri/mashtree/

mashtree_bootstrap.pl --outmatrix case0_dist --reps 100 --numcpus 52 --file-of-files case0_files.txt -- --sort-order random \ --genomesize 750000000 --mindepth 0 --kmerlength 25 --sketch-size 10000 > case0_bs_tree.dnd

Thanks so much for your time -- I appreciate any help/coaching you might be able to give!

Kindly, Charity

lskatz / mashtree

Bootstrapping requires a lot of resources? #81