vCPU Utilization ~33% ( and GIAB Concordance Work )

iamh2o commented 3 weeks ago

Ahoy!

I am quite excited to explore in more detail how strobealign behaves in b37 & hg38 across all of the GIAB samples ( at least for NOVA reads, others if budget allows ).

I have built the tool from source, and am working with:

./resources/strobealign/bin/strobealign --version
0.13.0-25-g3a97f6b

This is running well for me, but not optimally. When I monitor the vCPU utiization, if I set -t to $nproc for the machine I'm using, I can only every coerce the tool to occupy ~ 1/3 of the indicated threads/cpus. Increasing the number wildly has little effect, reducing it has a smaller effect than I'd have expected. Do you have guidance on if this is expected behavior or not (and if not, what might I try to boost cpu utilization).

Even with this inefficiency, I am getting runtimes which are roughly 30% of the paired bwa mem2 runtimes for the same sample. At the moment, I am waiting for deepvariant and concordance #'s to be generated, which I'll happily share.

I have a few other observations / questions:

I do not believe the -v flag changes verbosity.
The stdout behaviour is very nice for manual running and watching runs, however, the logging output does not play nicely with pipeline wrappers. Specifically, the single line over-writing behavior will frequently not appear in logs at all, which is frustrating as I'd like to know the rate metrics which are presented there. Could this behavior be set to togglable for manual stdoutputting and w/out the overwriting line for batch outputting?
I do not see a flag similar to -Y which bwa and others provide. I use this flag so that I can save the complete read info in the BAM which allows me to delete the input fastas (b/c I can re-generate the fastas from BAM as needed, or stream them out to tools directly from BAM more commonly). Both unmapped reads and clipped alignments are needed to do this. Being unable to reconstitute fastas from BAM would be a blocker from me in advocating adoption widely as being able to delete the fastas saves a substantial amount of $.

That's all. I'm really very impressed with how quickly the docs helped me get rolling, and how well the tool has behaved out of the gate. Thanks again!

John Major

ksahlin commented 3 weeks ago

Hi John,

Thanks for kind words and feedback! @marcelm and I will have a meeting tomorrow and discuss this.

Best, Kristoffer

iamh2o commented 3 weeks ago

Great!

Add'l Deets

It might be helpful to know that I'm running on 128 and 192 core (mem > 360G) intel EC2 instances using ubuntu 22.04.

GIAB b37 Runtime Data Ready

I should mention that I am not specifically tracking strobealign's runtime, but the combined strobealign->samtools sort->samtools view > {out.bam,out.bam.bai}.

Results are in for b37 for all 7 GIAB samples, aligning the public NOVA ~30x data,

and align+sort+write bam averages 29m (range 27m-32m),

which on spot instances costs in compute ~ $1.50 / sample (pretty awesome) and if cpu utilization could improve, this could theoretically clock in at $0.50/sample (which is great & all open source!).

My Full Command

not a fully optimized incantation


OMP_NUM_THREADS=64 OMP_PROC_BIND=close OMP_PLACES=threads OMP_PROC_BIND=TRUE OMP_DYNAMIC=TRUE OMP_MAX_ACTIVE_LEVELS=1 OMP_SCHEDULE=dynamic OMP_WAIT_POLICY=ACTIVE  avx2_compiled/strobealign \
          -t 128 \          
          -v     \
          --rg '@RG\tID:x_$epocsec\tSM:x\tLB:RIH0_ANA0-HG00119_DBC0_0_presumedNoAmpWGS\tPL:presumedILLUMINA\tPU:presumedCombinedLanes\tCN:CenterName\tPG:strobealigner'     \
          --use-index. \     /fsx/data/genomic_data/organism_references/H_sapiens/b37/human_1kg_v37/human_1kg_v37.fasta          \  
             <(unpigz -c  -q -- results/day/b37/RIH0_ANA0-HG001-19_DBC0_0/RIH0_ANA0-HG001-19_DBC0_0.R1.fastq.gz ) \
             <(unpigz -c  -q -- results/day/b37/RIH0_ANA0-HG001-19_DBC0_0/RIH0_ANA0-HG001-19_DBC0_0.R2.fastq.gz )   \
           |   samtools sort -l 0  -m 2G  -@  64 -T $tdir  -O SAM -   \
           |  samtools view -b -@ 0 -O BAM --write-index -o results/day/b37/RIH0_ANA0-HG001-19_DBC0_0/align/strobe/RIH0_ANA0-HG001-19_DBC0_0.strobe.sort.bam##idx##results/day/b37/RIH0_ANA0-HG001-19_DBC0_0/align/strobe/RIH0_ANA0-HG001-19_DBC0_0.strobe.sort.bam.bai -

note: paired reads, lengh 151bp

Variant Calling w/DeepVariant (in progress)

One sample is complete, and comparing vs bwa mem2 calls, the concordance is quite favorable. The remaining 6 I'll drop a table here tomorrow.

iamh2o commented 3 weeks ago

quite favorable is an understatement even :-) Given deep variant is certainly biased towards bwa alignments, I'm really exited to see what a bit of optimization and tuning will yield.

marcelm commented 3 weeks ago

Hi, thanks for the feedback!

We’ll be able to say more tomorrow, but just a couple of comments already now (not directly in regard to your main question).

I do not believe the -v flag changes verbosity.

It should, it may just not be that noticable because it doesn’t print that much more info. With -v, for example, you get the build type, whether AVX2 is enabled and also index statistics. We’d be happy to add more helpful output, are you interested in something in particular (apart from rate metrics, see below)?

The stdout behaviour is very nice for manual running and watching runs, however, the logging output does not play nicely with pipeline wrappers. Specifically, the single line over-writing behavior will frequently not appear in logs at all, which is frustrating as I'd like to know the rate metrics which are presented there. Could this behavior be set to togglable for manual stdoutputting and w/out the overwriting line for batch outputting?

Yes, strobealign prints the metrics only when stderr is connected to a terminal. It would make a lot of sense to always print the rate metrics once after all reads have been mapped. Would that be sufficient or would you like to be able to see the rate metrics while strobealign is running and even when stderr is not a terminal? Maybe we could print the metrics once a minute or so if stderr is a file (without the line overwriting behavior).

I do not see a flag similar to -Y which bwa and others provide. I use this flag so that I can save the complete read info in the BAM which allows me to delete the input fastas (b/c I can re-generate the fastas from BAM as needed, or stream them out to tools directly from BAM more commonly). Both unmapped reads and clipped alignments are needed to do this. Being unable to reconstitute fastas from BAM would be a blocker from me in advocating adoption widely as being able to delete the fastas saves a substantial amount of $.

Do you really need -Y for reconstructing the FASTQ? BWA-MEM’s -Y option says: "use soft clipping for supplementary alignments" (I assume the sentence should be continued "... instead of hard-clipping"). There will always be a primary alignment record (one that is neither supplementary nor secondary) that contains the full sequence, so isn’t it the case that -Y just adds redundant info?

Results are in for b37 for all 7 GIAB samples, aligning the public NOVA ~30x data,

Which dataset is that exactly? I’d like to test running this myself. I’m only aware of the datasets at https://github.com/genome-in-a-bottle/giab_data_indexes.

My Full Command

not a fully optimized incantation

A couple of comments regarding your command:

The --rg option will not work like this. We chose to be compatible with Bowtie2’s command-line options, not the ones from BWA-MEM. So you need to write --rg-id=my_id --rg=SM:x --rg=LB:my_libraryname etc. We could possibly add compatibility with BWA-MEM so that both would work.
Indexing the reference is very fast, especially in recent strobealign versions and if you have many cores. You should check whether reading in a pre-generated index from disk is actually faster. (On my machine, it takes ~30 seconds to generate the index from scratch and about the same time to read it from SSD.)
For decompression, I recommend giving igzip (part of Intel ISA-L) a try. pigz cannot actually decompress in parallel. It is still 10% to 2X faster than regular gzip in my tests (I believe depending on how it was compiled), but igzip is 4X times faster. (Compression would be a different story.) We plan to integrate igzip directly into strobealign.
You should be able to leave out the samtools view step. Just pass --write-index directly to samtools sort.

iamh2o commented 2 weeks ago

(I will absolutely respond to the open items above, but it might take me a few days to get back you you)

In The Mean Time

Initial Results !

I'm planning to publish these results in some way, and will share the nitty gritty details when I do. For now however:

strobealigner is running 2.14x faster than bwa mem2.

deepvariant call concordance from strobealigner BAMS are comprable to bwa mem2 call concordance. mem2 wins for recall & f-score, strobealigner (^2 ^3) wins for everything else.

Sample Data

The NOVAseq ~30x noamp WGS 2x150bp reads for HG001,2,3,4,5,6,7. High confidence regions and variant calls from GIAB/ga4gh.

Reference

b37

Compute Situation

AWS parallel cluster and a snakemake pipeline orchestrator (but any orchestrator will, they all do the same thing). I developed this framework that I use for this kind of work -note, i'm presently tearing it apart and cleaning it up after some time away.. For the following results, I effectively use 192vCPU spot instances (r7i.metal-48xl) throughout b/c the pricing worked out that way yesterday, which was ~$3.07/hr (192vcpu spots can range up to ~$4.5). Parallel cluster dynamically manages all of the instance price optimizations and spinning up/killing as needed <--- which is all pretty awesome IMO.

Pipeline Stages

Aligners

bwa mem2 (compiled with the intel compiler, and is avx512 enabled)
strobealign (compiled with -O3 -march=<i7 EC2 arch>, which only seems to enable avx2, despite avx512 present )

Dedup

doppelmark.

Variant Calling

deepvariant (avx512 compiled)

Concordance

rtg vcfeval

Detailed Summary Results

Aligner	Fscore	Sensitivity-Recall	Specificity	FDR	PPV	Precision	Pipeline Runtime (wall, hr)	Cost To Align+Sort (^1) (spot 192vCPU)	Cost To Dedup (spot 192vCPU)	Deepvariant Cost (spot 192vCPU)
bwa mem2	0.996	0.994	0.9999979257	0.0014	0.9986	0.9986	1.78	3.67	0.64	8.89
strobealigner	0.995	0.991	0.9999980400	0.0013	0.9987	0.9987	1.05	1.53	0.57	8.86

^1 bwamem2 uses all 192 cores maximally, strobealigner only uses ~33%. This suggests a substantial further speedup is still possible for strobealigner. Numbers do not include the suggested change to samtools sort and samtools write, which could well save time as well.

^2 deepvariant was trained on alignments largely produced by bwa mem, I imagine if retrained with strobealign data, these numbers could improve for strobealign significantly. It might also be informative to try running octopus on the strobealign bams, but using a strobealign random forest model first.

^3 I have not yet looked at the membership of variant calls in bwa mem2 variant calls and strobealign calls. I'm curious to see what the three populations of call sets look like.

TO DO

Generate the above vs. hg38.
Investigate the variant calls more closely.
Generate concordance by SNVts/tv, del, ins, indel.
re-run full analysis with any tweaks learned from prelim work.

1 Further Suggestion

Allow specifying the location of the index file in a more explicit way. And consider removing the dynamic index generation completely.
- My primary concern is around reproducibility.
- Allowing running programs the ability to create & potentially overwrite reference files that are needed to reproduce results in the future makes me uneasy. I would prefer to manually build indexes in advance of production use, and be required to explicitly use them (similar to how most other aligners treat indexes). The requirement for me being I need to unambiguously know which index was used to produce alignments, and can unambiguously re-use the same index in the future.
- I mount my reference directories dynamically for each compute node, and so that they are read only (for reasons described above). Strobealign fails when it can not write the index parallel to the fasta, which is great. I achieved the above index certainty by manually creating the necessary index and moving it beside the fastq.

marcelm commented 2 weeks ago

Hi, we have never tested running strobealign on more than 20 cores before and only recently got access to machines with up to 256 cores, so this is a good time to do some measurements.

I ran strobealign on a dataset with 100 million 150bp paired-end reads. The command was

strobealign -t ${threads} CHM13.fa in.1.fastq.gz in.2.fastq.gz > /dev/null

Neither samtools view nor samtools sort are involved in order to ensure that these programs are not the bottleneck.

threads (`-t`)	wall-clock time [s]	CPU time [s]	CPU usage	mapping-only CPU usage
1	899	893	1.0	CPU
1	899	893	1.0	1.0
2	461	898	1.9	2.0
3	350	998	2.9	3.0
4	254	941	3.7	4.0
6	181	964	5.3	6.1
8	138	948	6.9	8.1
12	111	1056	9.5	12.3
16	88	1055	12.0	16.3
18	88	1214	13.7	18.7
20	70	1007	14.3	20.4
24	66	1024	15.5	24.1
32	62	1253	20.2	32.3
36	58	1208	21.0	35.6
48	57	1168	20.6	42.0
64	59	1197	20.3	40.0

The total wall-clock time includes a part in the beginning where strobealign reads the reference and indexes it; this is not fully parallelized. The CPU usage at that stage is at most around 600% due to the serial bits. This has significant impact on the CPU usage average because I mapped relatively few reads in this experiment.

To compensate, I added the 'mapping-only CPU usage' column. This is an estimate. The more reads the dataset contains, the closer the CPU usage should be to that number.

I think the numbers are quite ok up to about 36 cores.

The problem with more cores is likely that the gzip decompression is just too slow:

Decompression within strobealign (using zlib) takes about 0.2 µs/read (I measured this running ordinary gzip -dc, but strobealign and gzip both use zlib, so it should be the same)
strobealign maps at about 10 µs/read
So at a certain number of threads which is on the order of 50, the single-threaded gzip decompression is just too slow to supply reads to the multi-threaded mapping part.

marcelm commented 2 weeks ago

And here are some measurements with the following differences:

igzip is used, like this:

strobealign -t ${threads} CHM13.fa <(igzip -dc in.1.fastq.gz) <(igzip -dc in.2.fastq.gz) > /dev/null

It decompresses three times faster, at about 0.06 µs/read.

Size of dataset increased to 1 billion paired-end reads.

threads	wall-clock time [s]	CPU time [s]	CPU usage	est. mapping-only CPU usage
16	1005	15642	15.6	15.9
32	522	15771	30.2	31.6
64	271	15349	56.6	62.2
128	235	16325	69.5	77.9

So my suggestion would be to use igzip. And even then, I guess it would be more cost-effective to only use an instance with at most 64 cores.

marcelm commented 1 week ago

With the changes from PR #418, the numbers look like this:

threads	wall-clock time [s]	CPU time [s]	CPU usage	est. mapping-only CPU usage
32	616.6	18900	30.70	32.10
64	288.5	16724	57.97	63.96
128	168.5	17607	104.48	125.92

And with those changes, it may make sense to use an instance with 128 cores (unless adding samtools sort makes the overall CPU usage go down again).

I’m quite happy with this TBH, this should make strobealign quite usable in settings with many cores.

ksahlin / strobealign