Options for parallelization/multi-core runs

chapmanb commented 7 years ago

Jeremiah; Now that we have VariantBam maximum coverage downsampling in bcbio (thank you again), we've been working on doing benchmarking runs with and without downsampling. Because we need to run it on sorted BAMs it ends up running essentially on it's own as the streaming output comes from bamsormadup. Since it's single threaded, this adds a noticeable amount of time to processing whole genome BAMs (where the downsampling is most useful). I wanted to pick your brain to see if there were options to run multicore or otherwise parallelize maximum coverage downsampling to improve runtimes. My only thought was to try and parallelize by region but then we start running into IO and merging bottlenecks which may not provide much of an improvement. Thanks for any suggestions or tips on speeding up the downsampling.

walaj commented 7 years ago

Hi Brad, I've thought about this before, but don't really see a good solution. I just used valgrind to profile, and as expected the majority of the compute is just reading and writing the reads. In so far as even samtools doesn't do multi-core reading/writing, I think implementing this efficiently for VariantBam without messing with the order of the reads or having to merge would be very difficult.

Quick benchmarks: samtools view BAM to SAM for 1m reads: 5.3s variant BAM to SAM (no filter) for 1m reads: 6.2s variant -m 100 BAM to SAM (max coverage 100) for 1m reads: 8.6s

So it looks like there's a small penalty to do the coverage calcs / downsampling, but the bottleneck remains the reading/writing steps. One caveat is that SeqAn library has a way to multicore reading/writing, since most of the BAM read/write compute is in the BGZF compression/deflation rather than I/O.

chapmanb commented 7 years ago

Jeremiah; Thanks so much for this discussion and the practical examples. I explored this more with some examples and agree that the extra cost is primarily read/write but think there might be some room for optimization based on samtools view multithread output, which eliminates most of this extra read/write overhead.

Essentially what I'm running is a bamsormadup followed by VariantBam and if I look at a subset of 2 million reads, VariantBam performs similarly to samtools view or bamsormadup:

samtools view -h | head -2000000 | bamsormadup  0m59.826s
samtools view -h | head -2000000 | samtools view -b 1m2.150s
samtools view -h | head -2000000 | variant - -b  1m0.361s

When I chain samtools with VariantBam, it is also fine:

samtools view -h | head -2000000 | samtools view -u | variant - -b 0m55.402s

But when I add VariantBam after bamsormadup, I end up with a pretty large runtime increase:

samtools view -h | head -2000000 | bamsormadup level=0 | variant - -b 1m27.716s

To be fair, this also true with adding a samtools view afterward:

samtools view -h | head -2000000 | bamsormadup level=0 | samtools view -b 1m27.001s

However, if I parallelize samtools view with 2 threads (-@ 2), this removes most of the overhead:

samtools view -h | head -2000000 | bamsormadup level=0 | samtools view -b -@ 2 1m4.795s

FYI, I also don't think the maximum coverage calculations play much role, there is some overhead but not nearly as much as the read/write costs:

samtools view -h | head -2000000 | bamsormadup level=0 | variant - -b  --max-coverage 60000 --mark-as-qc-fail 1m38.585s

So there might be some nice improvement to be had if we can take advantage of multithreading support. I know VariantBam uses SeqLib which uses htslib, but don't know the internals well enough to have an idea if it's feasible to leverage it similarly to how samtools does. Do you think this is worth exploring?

Thanks again for the discussion and help.

walaj commented 7 years ago

Ah, I hadn't appreciated that samtools could use parallelization for reading/writing (I thought it was just for sorting). That makes a big difference. To that extent, it would be a matter of reviewing the code for how samtools does this, and incorporating into SeqLib.

Full disclosure is that my bandwidth at the moment for adding new features to SeqLib is really low. I'm happy to leave this issue open and address when I get the chance -- it would be a fun project, but it might take some time for me to get to. Thanks for pointing this out and the suggestion, I hadn't realized that multicore might work for SeqLib without (hopefully) too much refactoring.

walaj commented 7 years ago

I ended up just taking a look at how samtools does this, and it's as simple as associating their thread-pool infrastructure with the hts file objects in SeqLib. Got it to work pretty quickly:

variant $na1 -b -t 5 -m 10 | head -n1000000 > tmp.bam  # 0m16.845s
variant $na1 -b        -m 10 | head -n1000000 > tmp1.bam # 1m1.350s

And converting the BAMs back to SAM and doing a diff reveals they are the same. I'm excited this worked so easily, could now translate to any SeqLib project!

When you update, just make sure to update SeqLib as well.

chapmanb commented 7 years ago

Jeremiah; Wow, thank you for looking at and implementing this so quickly. I did some testing and am seeing intermittent deadlocks/hangs with multiple threads. -t 0 always works but on both standard and streaming inputs I get regular hangs. I'm not an expert at debugging these but hooked up gdb to a run and when I ctrl-C the stuck job it errors out in the read:

(gdb) run Test1-sort.bam -b -t 2 > /dev/null
Starting program: /usr/local/bin/variant Test1-sort.bam -b -t 2 > /dev/null
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff676d700 (LWP 7383)]
[New Thread 0x7ffff5f6c700 (LWP 7384)]
[New Thread 0x7ffff576b700 (LWP 7385)]
^C
Program received signal SIGINT, Interrupt.
pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
238     ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S: No such file or directory.
(gdb) bt
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1  0x000000000046ca7c in hts_tpool_next_result_wait (q=0x6e5120)
#2  0x000000000049e909 in bgzf_read_block (fp=fp@entry=0x6e4f60)
#3  0x000000000049f9ef in bgzf_read (fp=fp@entry=0x6e4f60, data=data@entry=0x7fffffffc6f0, length=length@entry=4)
#4  0x0000000000460524 in bam_hdr_read (fp=0x6e4f60)
#5  0x0000000000462c71 in sam_hdr_read (fp=0x6e4ef0)
#6  0x0000000000447d3a in SeqLib::_Bam::open_BAM_for_reading (this=this@entry=0x7fffffffc8b0, t=...)
#7  0x0000000000449bca in SeqLib::BamReader::Open (this=this@entry=0x7fffffffd180, bam=...)
#8  0x0000000000407420 in main (argc=<optimized out>, argv=<optimized out>)

Thank you again for looking at this and let me know if this helps provide any pointers I can use for debugging more.

ohofmann commented 7 years ago

Just a +1 from me. Being able to run this at scale would help immensely to streamline our current workflow.

walaj commented 7 years ago

Hi Brad, Hmm, I'm really not sure what is happening. I believe the problem on your build, but am not able to recreate here. Perhaps using helgrind would give more information?

http://valgrind.org/docs/manual/hg-manual.html

I've not used helgrind much, but use other valgrind components extensively. If I get more time, I can look as well with helgrind, but it might take a bit to get to.

chapmanb commented 7 years ago

Jeremiah; Thanks for the ideas for debugging this and helping keep us moving. I'm not sure why we're seeing different behavior but it locks up consistently for me on both my local machine (Ubuntu 14.04) and AWS instances (Ubuntu 16.04). It seems worse on AWS, so I wonder if there is some relation between SSD disk speed and the locks, but that is pure speculation. If you'd like to try on the same build I have you can install an isolated conda virtual env and add the build I have with:

wget https://repo.continuum.io/miniconda/Miniconda2-latest-Linux-x86_64.sh
venv/bin/conda install -c conda-forge -c bioconda libgcc xz bzip2 zlib
wget https://s3.amazonaws.com/cloudbiolinux/bioconda/variantbam-1.4.4a-2.tar.bz2
venv/bin/conda install variantbam-1.4.4a-2.tar.bz2
venv/bin/variant --help

I tried using helgrind but to my eyes it's not giving any extra useful information over the gdb trace above. Here are logs from -t 0 which does not deadlock and -t 2, which does:

https://gist.github.com/chapmanb/9e2fb7d6e79f9c1380375ed4f218982d

When ctrl-cing out of the deadlock I get a similar strack trace to what we saw with gdb.

I'm still stuck on what to try but hopefully either reproducing or some info in the helgrind output give clues about how to proceed. Thank you again for all the help with this.

brainstorm commented 7 years ago

@chapmanb @walaj I can reproduce this issue, compiling by hand as shown in the README on Ubuntu zesty running on AWS:

# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 17.04
Release:        17.04
Codename:       zesty

Prepending the run with strace (strace variant Test1-sort.bam -t12 -b -o /dev/null) right after reading the bam it gets stuck:

open("Test1-sort.bam", O_RDONLY)        = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=3883039, ...}) = 0
read(3, "\37\213\10\4\0\0\0\0\0\377\6\0BC\2\0e\2\335\226]o\2330\24\206\331\244Ms\177\205"..., 4096) = 4096
mmap(NULL, 135168, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd1bbd19000
mmap(NULL, 8392704, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7fd1ae743000
mprotect(0x7fd1ae743000, 4096, PROT_NONE) = 0
clone(child_stack=0x7fd1aef42fb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7fd1aef439d0, tls=0x7fd1aef43700, child_tidptr=0x7fd1aef439d0) = 82767
futex(0x556b0269b76c, FUTEX_WAIT_PRIVATE, 1, NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x556b0269b740, FUTEX_WAKE_PRIVATE, 1) = 0
gettimeofday({1503144444, 23068}, NULL) = 0
futex(0x556b0269b814, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 1, {1503144454, 23068000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
gettimeofday({1503144454, 23330}, NULL) = 0
futex(0x556b02699258, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x556b0269b814, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 3, {1503144464, 23330000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
gettimeofday({1503144464, 23566}, NULL) = 0
futex(0x556b02699258, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x556b0269b814, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 5, {1503144474, 23566000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)

Easy to reproduce the above trace with tiny bam files with a fresh master clone of VariantBam from yesterday:

$ git clone https://github.com/roryk/tiny-test-data
$ bowtie2 -x tiny-test-data/genomes/Hsapiens/hg19/bowtie2/hg19 -U tiny-test-data/wgs/mt_1.fq.gz -S example.sam
$ samtools view -b -S -o example.bam example.sam
$ strace variant example.bam -t12 -b -o /dev/null

Identifying/seeing the issue with another tool mutextrace:

root@ip-172-31-13-76:/home/ubuntu# mutextrace variant example.bam -t12 -b -o /dev/null
[1] mutex_init(1, FAST)
[1] mutex_init(2, FAST)
[1] mutex_init(3, FAST)
[1] mutex_init(4, RECURSIVE)
[1] mutex_lock(4)
[2] started
[2] mutex_lock(4) <waiting for thread 1> ...
[3] started
[3] mutex_lock(4) <waiting for thread 1> ...
[4] started
[4] mutex_lock(4) <waiting for thread 1> ...
[5] started
[5] mutex_lock(4) <waiting for thread 1> ...
[6] started
[6] mutex_lock(4) <waiting for thread 1> ...
[7] started
[7] mutex_lock(4) <waiting for thread 1> ...
[8] started
[8] mutex_lock(4) <waiting for thread 1> ...
[9] started
[9] mutex_lock(4) <waiting for thread 1> ...
[10] started
[10] mutex_lock(4) <waiting for thread 1> ...
[11] started
[11] mutex_lock(4) <waiting for thread 1> ...
[12] started
[12] mutex_lock(4) <waiting for thread 1> ...
[13] started
[13] mutex_lock(4)
[1] mutex_unlock(4)
[2] ... mutex_lock(4)
[1] mutex_lock(4) <waiting for thread 2> ...
(no more output...)
^C

brainstorm commented 7 years ago

I hate to introduce more entropy to this issue, but a quick threading stress test on OSX does not lock up, regardless of threads:

$ for threads in `seq 1 100`; do time variant Test1-sort.bam -t $threads -b -o /dev/null; done

Prints out timings and all executions come back in 1.4 seconds :-S

If I remove -o /dev/null, bam files look well formed too, so no errors either.

walaj / VariantBam

Options for parallelization/multi-core runs #13