vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.09k stars 193 forks source link

vg giraffe always uses one more thread than specified #3108

Open glennhickey opened 3 years ago

glennhickey commented 3 years ago

command

vg giraffe -t 4 ...

top

29251 hickey    20   0 7251000 4.939g  17136 R 499.7 15.8  13:40.99 vg giraffe -o gaf -p -t 4 --rescue-algorithm ...

is that why it's so fast?

jeizenga commented 3 years ago

This happens in mpmap too. I thought maybe it was a quirk of libvgio?

jltsiren commented 3 years ago

Apparently the option -t in any mapper caps the number of OpenMP threads, while vg::io::StreamMultiplexer uses a single std::thread for writing the output. The multiplexer runs in a loop, polling the buffers for each mapper thread during each iteration. If there is nothing to write, the multiplexer yields at the end of the iteration. If there are free CPU cores available, this is effectively a busy loop, and the multiplexer occupies the entire core regardless of the amount of data it has to write.

Depending on the scheduler, the multiplexer may also effectively run in a busy loop even when there are no free CPU cores available. At least on my local Ubuntu VM with 8 cores, the speed difference between running 7 and 8 Giraffe mapping threads is negligible. It could be be a bit more efficient to have the multiplexer thread sleep with std::condition_variable::wait() until a mapper thread notifies it.