Open vasudeva8 opened 10 months ago
I ran a spell checker over the DEMO and README files. Highlighted in bold. Some were there before, hence not easy to add comments to lines here, but it's trivial to search and fix.
Lines in README.md: "length equal or greather to given input" "This application shocases the use of mulitple queues,"
DEMO.md: "Read_multireg - This application showcases the usage of mulitple regionn " "Qtask_thread - This application shocases the use of mulitple queues, scheduling of different tasks and getting results in orderered and unordered fashion." "Pileup shows the transposed view of the SAM alignment data, i.e. it shows the the reference positions"
Spell check and other corrections made.
I have a rebased version here that we may want to use before doing further changes. It fixes that bizarre github induced merge commit.
I haven't squashed anything yet, so all commits are the same content, but we can do that when we're ready to merge.
thanks James, have used this rebased version and pushed it. also the thread pool is shared with input and output files as well, to make it a better sample as mentioned.
I'm still finding qtask_ordered to be considerably slower than it needs to be.
I reimplemented the qtask_ordered problem using my own logic, mostly copied from the way htslib does multi-threading of SAM by generating blocks of BAM records to operate on and separate reading and writing threads to keep things running.
My code for this is here: https://github.com/jkbonfield/htslib/blob/PR_1666d/tv2.c
$ for i in `seq 1 5`;do taskset -c 0-15 /usr/bin/time -f '%U user\t%S system\
t%e elapsed\t%P %%CPU' ./qtask_ordered ~/scratch/data/novaseq.10m.bam 16 /tmp;done
35.51 user 7.23 system 10.69 elapsed 399% %CPU
37.74 user 7.68 system 10.43 elapsed 435% %CPU
36.11 user 7.90 system 11.44 elapsed 384% %CPU
38.03 user 7.74 system 10.60 elapsed 431% %CPU
38.71 user 8.28 system 12.48 elapsed 376% %CPU
So the original is typically 10-11s and maybe around 400% CPU utilisation.
$ for i in `seq 1 5`;do taskset -c 0-15 /usr/bin/time -f '%U user\t%S system\t%e elapsed\t%P %%CPU' ../tv2 -@16 ~/scratch/data/novaseq.10m.bam /tmp/_.sam > /dev/null;done
18.33 user 3.06 system 2.33 elapsed 916% %CPU
19.42 user 3.43 system 2.67 elapsed 855% %CPU
19.27 user 3.56 system 2.64 elapsed 862% %CPU
18.40 user 3.05 system 2.40 elapsed 890% %CPU
19.43 user 3.54 system 2.70 elapsed 848% %CPU
This is 850-900% CPU utilisation and ~2.5s. So around 4x quicker. It's clear there is also a considerable difference in CPU usage, which I can't explain yet, but possibly it's the optimisations in the core counting algorithm. (I should probably remove those as it's not good at demonstrating the benefits of threading when the task we are performing is so quick.)
The output of out.sam and _.sam are identical, although I had to change my code a bit because we counted differently. I think the GC content in this PR is subtly wrong in that it is numbers of G and C over the sequence length, not over the numbers of A, C, G and T. It only differs when Ns are present (or ambiguity codes, but they didn't occur). We don't know what the N base is so it shouldn't be counted in the maths.
Anyway I digress. If we want a demonstration of threading then it needs to be performant.
I also have a different version of this (tv1) which is purely counting the frequency of each base type and reporting summary statistics. That doesn't write anything back. However it means the order in which we do our sums is irrelevant as A + B + C is the same as A + C + B. Hence this becomes a trivial change to unordered threads instead. With hindsight I think that's a simpler example, but the one here may still be relevant too and has practical purpose.
I tried changing qtask_ordered to use "wb" to write to "out.bam" instead. It obviously then uses more CPU up as compression of the output becomes significant, but it's still behind tv2's BAM output.
#qtask_ordered
100.16 user 4.93 system 10.18 elapsed 1031% %CPU
100.79 user 5.13 system 10.42 elapsed 1016% %CPU
100.58 user 4.86 system 10.18 elapsed 1035% %CPU
100.91 user 5.17 system 10.43 elapsed 1016% %CPU
100.39 user 5.15 system 10.44 elapsed 1010% %CPU
#tv2
84.84 user 2.06 system 6.31 elapsed 1375% %CPU
84.44 user 1.72 system 5.93 elapsed 1452% %CPU
84.55 user 1.98 system 5.93 elapsed 1458% %CPU
84.77 user 1.73 system 5.94 elapsed 1454% %CPU
84.38 user 1.81 system 6.02 elapsed 1431% %CPU
It's interesting we still see a signficant "system" time difference, which could be something to do with malloc/free and zeroing pages, or it could be something else such as pthread interfaces. I haven't profiled it to find out.
On a larger file this seems to be more pronounced too:
#qtask_ordered
1226.54 user 290.74 system 316.69 elapsed 479% %CPU
#tv2
710.26 user 17.12 system 49.46 elapsed 1470% %CPU
715.45 user 15.22 system 50.37 elapsed 1450% %CPU
System time here is vastly bigger, which is a big hint. Note it's possible to profile the main thread and main thread only, to see where it's spending the CPU time. That in turn may explain why it can't parallelise as well as it ought to. Eg:
./qtask_ordered /local/scratch01/jkb/SRR6882909.name-.bam 16 /tmp & perf record -t $!
[ Wait and then control-C ]
perf report then shows:
50.93% qtask_ordered [kernel] [k] 0xffffffff99800190 ▒
15.16% qtask_ordered libc-2.27.so [.] _int_malloc ▒
9.10% qtask_ordered libc-2.27.so [.] calloc ▒
6.76% qtask_ordered libc-2.27.so [.] __libc_malloc ▒
4.54% qtask_ordered qtask_ordered [.] bam_read1 ▒
2.64% qtask_ordered libc-2.27.so [.] __memmove_sse2_unaligned_erms ▒
1.75% qtask_ordered qtask_ordered [.] bgzf_read ▒
1.67% qtask_ordered libc-2.27.so [.] __lll_lock_wait_private ▒
1.11% qtask_ordered qtask_ordered [.] sam_realloc_bam_data ▒
1.07% qtask_ordered qtask_ordered [.] sam_read1 ▒
0.82% qtask_ordered libc-2.27.so [.] memcpy@GLIBC_2.2.5 ▒
0.74% qtask_ordered qtask_ordered [.] main ▒
0.63% qtask_ordered qtask_ordered [.] bam_tag2cigar ▒
It's somewhere in the kernel, but that's probably memory page clearing given the other high CPU functions. Looking at it, it seems to be spending most of its time allocating bam objects. This is one way my implementation majorly differs. I don't use bam_init1
and I don't free the bams between successive calls. I have a queue of bam arrays that I push onto and pull from, so I can reuse my bam-array from a previous job. Don't give the memory back to the system until program exit and it'll be MUCH faster!
Thanks for the sample James.
The aim of the sample was just to demonstrate the usage of queues and hts thread pool, specifically the 2 different ways of its usage, made as basic as possible for that purpose. It had no intention to show the user how best to use threading in general to get best performance.
Based on earlier reviews, I have incorporated the input and output file thread pool sharing as it gives some direction on hts thread pool usage. And taking it to next level with separate read/write threads may make the sample complex and shifts the objective of it.
But if we need one such sample, may be we should add as a separate one? Or just refer the user to a samtools functionality?
Thanks James and Rob. I have merged and updated the samples. Now they use alignments in chunks and reuses memory. Also the removed lookup / decoding of base from loop to outside the loop. The unordered processing now just dumps the count of bases and gc ratio.
I've not checked the code yet, but I can verify the speed is now similar to my own implementations.
I think there are some minor niceties with this directory to fix too.
flags: flags.c
Samples updated with fasta/fastq index using, tabix index and thread pool queueing of tasks.