seqan / seqan3

The modern C++ library for sequence analysis. Contains version 3 of the library and API docs.
https://www.seqan.de
Other
406 stars 82 forks source link

single-thread mode #3239

Open notestaff opened 7 months ago

notestaff commented 7 months ago

Is it possible for seqan3-based programs to use only one CPU? I tried setting seqan3::contrib::bgzf_thread_count to 1, but the BAM-reading program still uses 200% CPU according to GNU time: one main thread and one for seqan3's decompression. Looking at the code, setting seqan3::contrib::bgzf_thread_count to 0 would not be supported, correct?

I'm trying to make a CLI like that of samtools: using one CPU by default, with an option to specify additional CPUs. Is there a way to do that? Thanks! @eseiler

eseiler commented 7 months ago

The bgzf handling is built around using a threadpool, so it always spawns at least one thread.

It should be possible to use the constructor via stream for the input.

If there should only be one thread, the gz_stream could be used, which should work for bgzf compressed files.

So:

Haven't tried it yet, but this would be my hacky workaround.

As for our code: It should be possible to just use gz (for input) if there is one thread requested. Not sure about performance implications for decompressing (probably none?). We can't really do it for output, because we would then write a gz file instead of bgzf.

rrahn commented 5 months ago

This seems to me like a recurring issue and I am wondering if the mechanism to switch to gz-decompression in favor of bgzf-compression should be more straightforward to handle in the API.

eseiler commented 5 months ago

This seems to me like a recurring issue and I am wondering if the mechanism to switch to gz-decompression in favor of bgzf-compression should be more straightforward to handle in the API.

I agree.

Another thing we had is that we used to write bgzf files when gz output was requested. bgzf is faster because it can be parallelised. However, bgzf is not the same as gz, though it's compatible. The binary representation is different and the file size differs (I think I had a case were a bgzf compressed FASTA file was 20% bigger than the gz compressed counterpart).

rrahn commented 5 months ago

True. Following this, I could make out the following four possible decisions that could be made by the user:

On output

  1. Use bgzf for output compression
    • default by spec
    • random access support
    • serial (no separate decompression thread) or parallel (at least two threads: 1 main, >= 1 decompression worker)
    • Which mode is default? If parallel how many threads are default?
  2. Allow user to explicitly switch to gz-compression
    • no random access support
    • always single-threaded

On Input

  1. Use bgzf-decompression if bgzf-decompressed
    • default by spec
    • always parallel
    • serial (no separate decompression thread) or parallel (at least two threads: 1 main, >= 1
  2. Allow user to explicitly use gz-decompression
    • always serial
    • independent of bgzf or gz-compression