samtools / htslib

C library for high-throughput sequencing data formats
Other
784 stars 447 forks source link

Demo / sample code update #1666

Open vasudeva8 opened 10 months ago

vasudeva8 commented 10 months ago

Samples updated with fasta/fastq index using, tabix index and thread pool queueing of tasks.

jkbonfield commented 8 months ago

I ran a spell checker over the DEMO and README files. Highlighted in bold. Some were there before, hence not easy to add comments to lines here, but it's trivial to search and fix.

Lines in README.md: "length equal or greather to given input" "This application shocases the use of mulitple queues,"

DEMO.md: "Read_multireg - This application showcases the usage of mulitple regionn " "Qtask_thread - This application shocases the use of mulitple queues, scheduling of different tasks and getting results in orderered and unordered fashion." "Pileup shows the transposed view of the SAM alignment data, i.e. it shows the the reference positions"

vasudeva8 commented 3 months ago

Spell check and other corrections made.

jkbonfield commented 2 months ago

I have a rebased version here that we may want to use before doing further changes. It fixes that bizarre github induced merge commit.

I haven't squashed anything yet, so all commits are the same content, but we can do that when we're ready to merge.

vasudeva8 commented 2 months ago

thanks James, have used this rebased version and pushed it. also the thread pool is shared with input and output files as well, to make it a better sample as mentioned.

jkbonfield commented 2 months ago

I'm still finding qtask_ordered to be considerably slower than it needs to be.

I reimplemented the qtask_ordered problem using my own logic, mostly copied from the way htslib does multi-threading of SAM by generating blocks of BAM records to operate on and separate reading and writing threads to keep things running.

My code for this is here: https://github.com/jkbonfield/htslib/blob/PR_1666d/tv2.c

$ for i in `seq 1 5`;do taskset -c 0-15 /usr/bin/time -f '%U user\t%S system\
t%e elapsed\t%P %%CPU' ./qtask_ordered ~/scratch/data/novaseq.10m.bam 16 /tmp;done
35.51 user  7.23 system 10.69 elapsed   399% %CPU
37.74 user  7.68 system 10.43 elapsed   435% %CPU
36.11 user  7.90 system 11.44 elapsed   384% %CPU
38.03 user  7.74 system 10.60 elapsed   431% %CPU
38.71 user  8.28 system 12.48 elapsed   376% %CPU

So the original is typically 10-11s and maybe around 400% CPU utilisation.

$ for i in `seq 1 5`;do taskset -c 0-15 /usr/bin/time -f '%U user\t%S system\t%e elapsed\t%P %%CPU' ../tv2 -@16 ~/scratch/data/novaseq.10m.bam /tmp/_.sam > /dev/null;done
18.33 user  3.06 system 2.33 elapsed    916% %CPU
19.42 user  3.43 system 2.67 elapsed    855% %CPU
19.27 user  3.56 system 2.64 elapsed    862% %CPU
18.40 user  3.05 system 2.40 elapsed    890% %CPU
19.43 user  3.54 system 2.70 elapsed    848% %CPU

This is 850-900% CPU utilisation and ~2.5s. So around 4x quicker. It's clear there is also a considerable difference in CPU usage, which I can't explain yet, but possibly it's the optimisations in the core counting algorithm. (I should probably remove those as it's not good at demonstrating the benefits of threading when the task we are performing is so quick.)

The output of out.sam and _.sam are identical, although I had to change my code a bit because we counted differently. I think the GC content in this PR is subtly wrong in that it is numbers of G and C over the sequence length, not over the numbers of A, C, G and T. It only differs when Ns are present (or ambiguity codes, but they didn't occur). We don't know what the N base is so it shouldn't be counted in the maths.

Anyway I digress. If we want a demonstration of threading then it needs to be performant.

I also have a different version of this (tv1) which is purely counting the frequency of each base type and reporting summary statistics. That doesn't write anything back. However it means the order in which we do our sums is irrelevant as A + B + C is the same as A + C + B. Hence this becomes a trivial change to unordered threads instead. With hindsight I think that's a simpler example, but the one here may still be relevant too and has practical purpose.

jkbonfield commented 2 months ago

I tried changing qtask_ordered to use "wb" to write to "out.bam" instead. It obviously then uses more CPU up as compression of the output becomes significant, but it's still behind tv2's BAM output.

#qtask_ordered
100.16 user 4.93 system 10.18 elapsed   1031% %CPU
100.79 user 5.13 system 10.42 elapsed   1016% %CPU
100.58 user 4.86 system 10.18 elapsed   1035% %CPU
100.91 user 5.17 system 10.43 elapsed   1016% %CPU
100.39 user 5.15 system 10.44 elapsed   1010% %CPU

#tv2
84.84 user  2.06 system 6.31 elapsed    1375% %CPU
84.44 user  1.72 system 5.93 elapsed    1452% %CPU
84.55 user  1.98 system 5.93 elapsed    1458% %CPU
84.77 user  1.73 system 5.94 elapsed    1454% %CPU
84.38 user  1.81 system 6.02 elapsed    1431% %CPU

It's interesting we still see a signficant "system" time difference, which could be something to do with malloc/free and zeroing pages, or it could be something else such as pthread interfaces. I haven't profiled it to find out.

On a larger file this seems to be more pronounced too:

#qtask_ordered
1226.54 user    290.74 system   316.69 elapsed  479% %CPU

#tv2
710.26 user 17.12 system    49.46 elapsed   1470% %CPU
715.45 user 15.22 system    50.37 elapsed   1450% %CPU

System time here is vastly bigger, which is a big hint. Note it's possible to profile the main thread and main thread only, to see where it's spending the CPU time. That in turn may explain why it can't parallelise as well as it ought to. Eg:

./qtask_ordered /local/scratch01/jkb/SRR6882909.name-.bam 16 /tmp & perf record -t $!
[ Wait and then control-C ]

perf report then shows:
  50.93%  qtask_ordered  [kernel]            [k] 0xffffffff99800190                                   ▒
  15.16%  qtask_ordered  libc-2.27.so        [.] _int_malloc                                          ▒
   9.10%  qtask_ordered  libc-2.27.so        [.] calloc                                               ▒
   6.76%  qtask_ordered  libc-2.27.so        [.] __libc_malloc                                        ▒
   4.54%  qtask_ordered  qtask_ordered       [.] bam_read1                                            ▒
   2.64%  qtask_ordered  libc-2.27.so        [.] __memmove_sse2_unaligned_erms                        ▒
   1.75%  qtask_ordered  qtask_ordered       [.] bgzf_read                                            ▒
   1.67%  qtask_ordered  libc-2.27.so        [.] __lll_lock_wait_private                              ▒
   1.11%  qtask_ordered  qtask_ordered       [.] sam_realloc_bam_data                                 ▒
   1.07%  qtask_ordered  qtask_ordered       [.] sam_read1                                            ▒
   0.82%  qtask_ordered  libc-2.27.so        [.] memcpy@GLIBC_2.2.5                                   ▒
   0.74%  qtask_ordered  qtask_ordered       [.] main                                                 ▒
   0.63%  qtask_ordered  qtask_ordered       [.] bam_tag2cigar                                        ▒

It's somewhere in the kernel, but that's probably memory page clearing given the other high CPU functions. Looking at it, it seems to be spending most of its time allocating bam objects. This is one way my implementation majorly differs. I don't use bam_init1 and I don't free the bams between successive calls. I have a queue of bam arrays that I push onto and pull from, so I can reuse my bam-array from a previous job. Don't give the memory back to the system until program exit and it'll be MUCH faster!

vasudeva8 commented 2 months ago

Thanks for the sample James.

The aim of the sample was just to demonstrate the usage of queues and hts thread pool, specifically the 2 different ways of its usage, made as basic as possible for that purpose. It had no intention to show the user how best to use threading in general to get best performance.

Based on earlier reviews, I have incorporated the input and output file thread pool sharing as it gives some direction on hts thread pool usage. And taking it to next level with separate read/write threads may make the sample complex and shifts the objective of it.

But if we need one such sample, may be we should add as a separate one? Or just refer the user to a samtools functionality?

vasudeva8 commented 3 days ago

Thanks James and Rob. I have merged and updated the samples. Now they use alignments in chunks and reuses memory. Also the removed lookup / decoding of base from loop to outside the loop. The unordered processing now just dumps the count of bases and gc ratio.

jkbonfield commented 1 day ago

I've not checked the code yet, but I can verify the speed is now similar to my own implementations.

I think there are some minor niceties with this directory to fix too.