ksahlin / strobealign

Aligns short reads using dynamic seed size with strobemers
MIT License
140 stars 17 forks source link

vCPU Utilization ~33% ( and GIAB Concordance Work ) #435

Open iamh2o opened 3 weeks ago

iamh2o commented 3 weeks ago

Ahoy!

I am quite excited to explore in more detail how strobealign behaves in b37 & hg38 across all of the GIAB samples ( at least for NOVA reads, others if budget allows ).

I have built the tool from source, and am working with:

./resources/strobealign/bin/strobealign --version
0.13.0-25-g3a97f6b

This is running well for me, but not optimally. When I monitor the vCPU utiization, if I set -t to $nproc for the machine I'm using, I can only every coerce the tool to occupy ~ 1/3 of the indicated threads/cpus. Increasing the number wildly has little effect, reducing it has a smaller effect than I'd have expected. Do you have guidance on if this is expected behavior or not (and if not, what might I try to boost cpu utilization).

Even with this inefficiency, I am getting runtimes which are roughly 30% of the paired bwa mem2 runtimes for the same sample. At the moment, I am waiting for deepvariant and concordance #'s to be generated, which I'll happily share.

I have a few other observations / questions:

That's all. I'm really very impressed with how quickly the docs helped me get rolling, and how well the tool has behaved out of the gate. Thanks again!

John Major

ksahlin commented 3 weeks ago

Hi John,

Thanks for kind words and feedback! @marcelm and I will have a meeting tomorrow and discuss this.

Best, Kristoffer

iamh2o commented 3 weeks ago

Great!

Add'l Deets

It might be helpful to know that I'm running on 128 and 192 core (mem > 360G) intel EC2 instances using ubuntu 22.04.

GIAB b37 Runtime Data Ready

I should mention that I am not specifically tracking strobealign's runtime, but the combined strobealign->samtools sort->samtools view > {out.bam,out.bam.bai}.

Results are in for b37 for all 7 GIAB samples, aligning the public NOVA ~30x data,

and align+sort+write bam averages 29m (range 27m-32m),

which on spot instances costs in compute ~ $1.50 / sample (pretty awesome) and if cpu utilization could improve, this could theoretically clock in at $0.50/sample (which is great & all open source!).

My Full Command

not a fully optimized incantation


OMP_NUM_THREADS=64 OMP_PROC_BIND=close OMP_PLACES=threads OMP_PROC_BIND=TRUE OMP_DYNAMIC=TRUE OMP_MAX_ACTIVE_LEVELS=1 OMP_SCHEDULE=dynamic OMP_WAIT_POLICY=ACTIVE  avx2_compiled/strobealign \
          -t 128 \          
          -v     \
          --rg '@RG\tID:x_$epocsec\tSM:x\tLB:RIH0_ANA0-HG00119_DBC0_0_presumedNoAmpWGS\tPL:presumedILLUMINA\tPU:presumedCombinedLanes\tCN:CenterName\tPG:strobealigner'     \
          --use-index. \     /fsx/data/genomic_data/organism_references/H_sapiens/b37/human_1kg_v37/human_1kg_v37.fasta          \  
             <(unpigz -c  -q -- results/day/b37/RIH0_ANA0-HG001-19_DBC0_0/RIH0_ANA0-HG001-19_DBC0_0.R1.fastq.gz ) \
             <(unpigz -c  -q -- results/day/b37/RIH0_ANA0-HG001-19_DBC0_0/RIH0_ANA0-HG001-19_DBC0_0.R2.fastq.gz )   \
           |   samtools sort -l 0  -m 2G  -@  64 -T $tdir  -O SAM -   \
           |  samtools view -b -@ 0 -O BAM --write-index -o results/day/b37/RIH0_ANA0-HG001-19_DBC0_0/align/strobe/RIH0_ANA0-HG001-19_DBC0_0.strobe.sort.bam##idx##results/day/b37/RIH0_ANA0-HG001-19_DBC0_0/align/strobe/RIH0_ANA0-HG001-19_DBC0_0.strobe.sort.bam.bai -

note: paired reads, lengh 151bp

Variant Calling w/DeepVariant (in progress)

One sample is complete, and comparing vs bwa mem2 calls, the concordance is quite favorable. The remaining 6 I'll drop a table here tomorrow.

iamh2o commented 3 weeks ago

quite favorable is an understatement even :-) Given deep variant is certainly biased towards bwa alignments, I'm really exited to see what a bit of optimization and tuning will yield.

marcelm commented 3 weeks ago

Hi, thanks for the feedback!

We’ll be able to say more tomorrow, but just a couple of comments already now (not directly in regard to your main question).

  • I do not believe the -v flag changes verbosity.

It should, it may just not be that noticable because it doesn’t print that much more info. With -v, for example, you get the build type, whether AVX2 is enabled and also index statistics. We’d be happy to add more helpful output, are you interested in something in particular (apart from rate metrics, see below)?

  • The stdout behaviour is very nice for manual running and watching runs, however, the logging output does not play nicely with pipeline wrappers. Specifically, the single line over-writing behavior will frequently not appear in logs at all, which is frustrating as I'd like to know the rate metrics which are presented there. Could this behavior be set to togglable for manual stdoutputting and w/out the overwriting line for batch outputting?

Yes, strobealign prints the metrics only when stderr is connected to a terminal. It would make a lot of sense to always print the rate metrics once after all reads have been mapped. Would that be sufficient or would you like to be able to see the rate metrics while strobealign is running and even when stderr is not a terminal? Maybe we could print the metrics once a minute or so if stderr is a file (without the line overwriting behavior).

  • I do not see a flag similar to -Y which bwa and others provide. I use this flag so that I can save the complete read info in the BAM which allows me to delete the input fastas (b/c I can re-generate the fastas from BAM as needed, or stream them out to tools directly from BAM more commonly). Both unmapped reads and clipped alignments are needed to do this. Being unable to reconstitute fastas from BAM would be a blocker from me in advocating adoption widely as being able to delete the fastas saves a substantial amount of $.

Do you really need -Y for reconstructing the FASTQ? BWA-MEM’s -Y option says: "use soft clipping for supplementary alignments" (I assume the sentence should be continued "... instead of hard-clipping"). There will always be a primary alignment record (one that is neither supplementary nor secondary) that contains the full sequence, so isn’t it the case that -Y just adds redundant info?

Results are in for b37 for all 7 GIAB samples, aligning the public NOVA ~30x data,

Which dataset is that exactly? I’d like to test running this myself. I’m only aware of the datasets at https://github.com/genome-in-a-bottle/giab_data_indexes.

My Full Command

not a fully optimized incantation

A couple of comments regarding your command:

iamh2o commented 2 weeks ago

(I will absolutely respond to the open items above, but it might take me a few days to get back you you)

In The Mean Time

Initial Results !

I'm planning to publish these results in some way, and will share the nitty gritty details when I do. For now however:

strobealigner is running 2.14x faster than bwa mem2.

deepvariant call concordance from strobealigner BAMS are comprable to bwa mem2 call concordance. mem2 wins for recall & f-score, strobealigner (^2 ^3) wins for everything else.

Sample Data

The NOVAseq ~30x noamp WGS 2x150bp reads for HG001,2,3,4,5,6,7. High confidence regions and variant calls from GIAB/ga4gh.

Reference

b37

Compute Situation

AWS parallel cluster and a snakemake pipeline orchestrator (but any orchestrator will, they all do the same thing). I developed this framework that I use for this kind of work -note, i'm presently tearing it apart and cleaning it up after some time away.. For the following results, I effectively use 192vCPU spot instances (r7i.metal-48xl) throughout b/c the pricing worked out that way yesterday, which was ~$3.07/hr (192vcpu spots can range up to ~$4.5). Parallel cluster dynamically manages all of the instance price optimizations and spinning up/killing as needed <--- which is all pretty awesome IMO.

Pipeline Stages

Aligners

Dedup

Variant Calling

Concordance

Detailed Summary Results

Aligner Fscore Sensitivity-Recall Specificity FDR PPV Precision Pipeline Runtime (wall, hr) Cost To Align+Sort (^1) (spot 192vCPU) Cost To Dedup (spot 192vCPU) Deepvariant Cost (spot 192vCPU)
bwa mem2 0.996 0.994 0.9999979257 0.0014 0.9986 0.9986 1.78 3.67 0.64 8.89
strobealigner 0.995 0.991 0.9999980400 0.0013 0.9987 0.9987 1.05 1.53 0.57 8.86

^1 bwamem2 uses all 192 cores maximally, strobealigner only uses ~33%. This suggests a substantial further speedup is still possible for strobealigner. Numbers do not include the suggested change to samtools sort and samtools write, which could well save time as well.

^2 deepvariant was trained on alignments largely produced by bwa mem, I imagine if retrained with strobealign data, these numbers could improve for strobealign significantly. It might also be informative to try running octopus on the strobealign bams, but using a strobealign random forest model first.

^3 I have not yet looked at the membership of variant calls in bwa mem2 variant calls and strobealign calls. I'm curious to see what the three populations of call sets look like.

TO DO

1 Further Suggestion

marcelm commented 2 weeks ago

Hi, we have never tested running strobealign on more than 20 cores before and only recently got access to machines with up to 256 cores, so this is a good time to do some measurements.

I ran strobealign on a dataset with 100 million 150bp paired-end reads. The command was

strobealign -t ${threads} CHM13.fa in.1.fastq.gz in.2.fastq.gz > /dev/null

Neither samtools view nor samtools sort are involved in order to ensure that these programs are not the bottleneck.

threads (-t) wall-clock time [s] CPU time [s] CPU usage mapping-only CPU usage
1 899 893 1.0 CPU
1 899 893 1.0 1.0
2 461 898 1.9 2.0
3 350 998 2.9 3.0
4 254 941 3.7 4.0
6 181 964 5.3 6.1
8 138 948 6.9 8.1
12 111 1056 9.5 12.3
16 88 1055 12.0 16.3
18 88 1214 13.7 18.7
20 70 1007 14.3 20.4
24 66 1024 15.5 24.1
32 62 1253 20.2 32.3
36 58 1208 21.0 35.6
48 57 1168 20.6 42.0
64 59 1197 20.3 40.0

The total wall-clock time includes a part in the beginning where strobealign reads the reference and indexes it; this is not fully parallelized. The CPU usage at that stage is at most around 600% due to the serial bits. This has significant impact on the CPU usage average because I mapped relatively few reads in this experiment.

To compensate, I added the 'mapping-only CPU usage' column. This is an estimate. The more reads the dataset contains, the closer the CPU usage should be to that number.

I think the numbers are quite ok up to about 36 cores.

The problem with more cores is likely that the gzip decompression is just too slow:

marcelm commented 2 weeks ago

And here are some measurements with the following differences:

threads wall-clock time [s] CPU time [s] CPU usage est. mapping-only CPU usage
16 1005 15642 15.6 15.9
32 522 15771 30.2 31.6
64 271 15349 56.6 62.2
128 235 16325 69.5 77.9

So my suggestion would be to use igzip. And even then, I guess it would be more cost-effective to only use an instance with at most 64 cores.

marcelm commented 1 week ago

With the changes from PR #418, the numbers look like this:

threads wall-clock time [s] CPU time [s] CPU usage est. mapping-only CPU usage
32 616.6 18900 30.70 32.10
64 288.5 16724 57.97 63.96
128 168.5 17607 104.48 125.92

And with those changes, it may make sense to use an instance with 128 cores (unless adding samtools sort makes the overall CPU usage go down again).

I’m quite happy with this TBH, this should make strobealign quite usable in settings with many cores.