haotianteng / Chiron

A basecaller for Oxford Nanopore Technologies' sequencers
Other
122 stars 53 forks source link

Optimized basecalling speed ~4x #45

Closed nmiculinic closed 6 years ago

nmiculinic commented 6 years ago

It works by parallizing logit decoding step from logit generation. Logit decoding step (beam/greedy) is hard for GPU, and it's performed on CPU.

To get into more detail, previous code first performed calculation from raw signal to logits on GPU, copied data to RAM, copied logits into python space (( to numpy array )), copied logits back into tensorflow space, executed decoding step on CPU, and all of this sequentially. I've noticed poor CPU and GPU utilization with this approach.

This speedup does the following. It sets up tensorflow logits and decoding queue which hold results from raw signal->logits(GPU bound) and logits->decoded(CPU bound) steps. By decoupling and deserializing this operation, that is making a pipeline, GPU & CPU can yield to better saturation and utilization. If logit queue raises, [[ seen as logits_q in "signal processing" progress bar ]], it means you're CPU bound (( or not enough threads for decode_queue )). Otherwise, you're probably GPU bound and should check via common tools. For my Linux workstation and NVIDIA GPUs I use the following commands:

htop     # CPU utilization
iostat -x  # disk utilization
nvidia-smi  # GPU utilization 

Here are the results for tests on Titan Xp for some reads I had GPU --> ~23s per iteration (( this improvement )) GPU --> ~96s per iteration

Test plan: I run example data through basecalled before/after: docker run --rm -it -v $(pwd)/thread:/tmp nmiculinic/chiron:thread chiron call --batch_size 1000 -i /opt/chiron/chiron/example_data -o /tmp

And compared sha512 sums:

 eval find thread -type f -exec sha256sum {} \;                                                                                                                                   
2830b95651d7043efb265644ad4a4423a86cf1e70e57210fe50e7b3f20b33cee  thread/meta/all.meta
6855666454fca7390ba372f30e18fba8abc2d9209283177d2f87340fe3e17bc7  thread/meta/read1.meta
ca52ed0b52109d175de94b14380dbf2e382e4ef7be2a0f3f9f2d3860cc6d8ae6  thread/meta/read2.meta
b6b78b1ab307108a1372d9b73c19182c49381ecde6b418f8646bf4dd2d354c9d  thread/meta/read3.meta
30e451f7952b9e73d8baa375da0ca7e8e5da5eca7dab2e79dadf00551c1e0c2e  thread/meta/read4.meta
17f860021bab1ddb37f67811439f19569b72b8272b8f8092a90148161f68a260  thread/meta/read5.meta
2a6d1c639570f782adc180c6b3af7eefb7c3d56ea69b7e130d52ac9580176878  thread/result/read1.fastq
d2ee2f7d7f0bc6121b68777744b64574228ae0096371db53479ccae7700dc714  thread/result/read2.fastq
29060f3bef7e40b1ef73a9f6e51c2da6683956cb3bd06636b80e30f04d2cb17d  thread/result/read3.fastq
92788630f260fb530c8ebd32d860d4ef3ff7b310438809b41dac9194049975a5  thread/result/read4.fastq
7970d1bbfc6779b3c2bfe6ea8e0149434469fdea9f5a4aa4b6693a2c38ce9bbe  thread/result/read5.fastq
746942a0010fdecc9bdf7c713087af06e64df21335a2a47a2f1e800c3e261972  thread/segments/read1.fastq
1d5615185e5c28cd8404ceb95b6b2e4d0a5fb68b265e1ef508a9b63562bc770e  thread/segments/read2.fastq
643e85354720c484d1152b0713845ac273ec4ff0a173e225b2901b43fa53aa94  thread/segments/read3.fastq
cdbf07ab8543deff521e7114c3942074e6327ae15e08e290140cb3572a83ca31  thread/segments/read4.fastq
de641d78f210bac01dbb2f67bb19947731deab882699d4eff3a4d12a8174fb13  thread/segments/read5.fastq
0c724c2a5bb2ff5768b4fa2eecc36c47f9e1c650230478d2f8896de755810eea  thread/raw/read5.signal
ad32825e9581cd3443db24b82645a64832c7856e24213933adb4cddfb7fdf157  thread/raw/read4.signal
45e96cae583e328b3b9e7049a6cf754deccfc1dab0a876c261c73f6ca97dee31  thread/raw/read3.signal
e4134ab8dd756395b5ca80edecacb22767ded7f2c486c11995133d239a4d1005  thread/raw/read2.signal
c6e825fb2819aa7698abc3db949f4ea52f9b0f9e3a1089ce30205166abe8bf9e  thread/raw/read1.signal
 eval find thread -type f -exec sha256sum {} \;                                                                                                                                    k8s [prod] 
 eval                                                                                                                                                                              k8s [prod] 
 eval find original -type f -exec sha256sum {} \;                                                                                                                                  k8s [prod] 
e74b02ec89c7e41cfec0fb39e24b369eaa1efb31a452aae8b36049a40d2b3973  original/meta/all.meta
8bd10dd7310873a16cf417c8fd056460904c3ca78c19547e928fcd050227e748  original/meta/read1.meta
59ea7edeab35a834b800105cb0b16ddd4f4dff427c6de8d7852a9e5944a206f8  original/meta/read2.meta
dfc374c81aa777fa312c1eb5a7a2785e09c7ff084fab93eb9c4ab94ae46afa71  original/meta/read3.meta
cfcdcf435fe31ffce6666ffe878fdd7299de08cacc9db5887d1f11896713ee88  original/meta/read4.meta
f8eed2c687c24ff905020d6d8533beb31b576aff7243d69b0a664576f956d550  original/meta/read5.meta
2a6d1c639570f782adc180c6b3af7eefb7c3d56ea69b7e130d52ac9580176878  original/result/read1.fastq
d2ee2f7d7f0bc6121b68777744b64574228ae0096371db53479ccae7700dc714  original/result/read2.fastq
29060f3bef7e40b1ef73a9f6e51c2da6683956cb3bd06636b80e30f04d2cb17d  original/result/read3.fastq
92788630f260fb530c8ebd32d860d4ef3ff7b310438809b41dac9194049975a5  original/result/read4.fastq
7970d1bbfc6779b3c2bfe6ea8e0149434469fdea9f5a4aa4b6693a2c38ce9bbe  original/result/read5.fastq
746942a0010fdecc9bdf7c713087af06e64df21335a2a47a2f1e800c3e261972  original/segments/read1.fastq
1d5615185e5c28cd8404ceb95b6b2e4d0a5fb68b265e1ef508a9b63562bc770e  original/segments/read2.fastq
643e85354720c484d1152b0713845ac273ec4ff0a173e225b2901b43fa53aa94  original/segments/read3.fastq
cdbf07ab8543deff521e7114c3942074e6327ae15e08e290140cb3572a83ca31  original/segments/read4.fastq
de641d78f210bac01dbb2f67bb19947731deab882699d4eff3a4d12a8174fb13  original/segments/read5.fastq
0c724c2a5bb2ff5768b4fa2eecc36c47f9e1c650230478d2f8896de755810eea  original/raw/read5.signal
ad32825e9581cd3443db24b82645a64832c7856e24213933adb4cddfb7fdf157  original/raw/read4.signal
45e96cae583e328b3b9e7049a6cf754deccfc1dab0a876c261c73f6ca97dee31  original/raw/read3.signal
e4134ab8dd756395b5ca80edecacb22767ded7f2c486c11995133d239a4d1005  original/raw/read2.signal
c6e825fb2819aa7698abc3db949f4ea52f9b0f9e3a1089ce30205166abe8bf9e  original/raw/read1.signal
 eval                                                                                                                                                                              
nmiculinic commented 6 years ago

For some reason, merging master brach commit:114bc25 breaks the code...investigating

nmiculinic commented 6 years ago

After investigation I've concluded:

haotianteng commented 6 years ago

Thanks a lot for the help! I will check it for merging. We have been trying to do this parallel before, glad you make it. I have done some benchmark:

Speed test for beam search decoder: beam 0 1 2 3 5 10 mean 1.522 1.737 1.982 2.212 2.581 3.461 std 0.028 0.152 0.157 0.161 0.159 0.160

0.194 s/ beam-width*batch(3000*512)*1 CPU

Solution: Use GPU to call Neural network and CPU to call Beam search. *1080Ti + ~ 8 beam width CPU number**

A ideal setting would be 1080Ti + 4 CPUs with 30 Beam width

However, beam search decoder in Tensorflow does not support multiple threading: https://github.com/tensorflow/tensorflow/issues/17136 So I am still waiting TF to enable it, but that's awesome you have made it work, really appreciate it.