Open notestaff opened 7 months ago
The bgzf handling is built around using a threadpool, so it always spawns at least one thread.
It should be possible to use the constructor via stream for the input.
If there should only be one thread, the gz_stream could be used, which should work for bgzf compressed files.
So:
Haven't tried it yet, but this would be my hacky workaround.
As for our code: It should be possible to just use gz (for input) if there is one thread requested. Not sure about performance implications for decompressing (probably none?). We can't really do it for output, because we would then write a gz file instead of bgzf.
This seems to me like a recurring issue and I am wondering if the mechanism to switch to gz-decompression in favor of bgzf-compression should be more straightforward to handle in the API.
This seems to me like a recurring issue and I am wondering if the mechanism to switch to gz-decompression in favor of bgzf-compression should be more straightforward to handle in the API.
I agree.
Another thing we had is that we used to write bgzf
files when gz
output was requested.
bgzf
is faster because it can be parallelised. However, bgzf
is not the same as gz
, though it's compatible.
The binary representation is different and the file size differs (I think I had a case were a bgzf
compressed FASTA file was 20% bigger than the gz
compressed counterpart).
True. Following this, I could make out the following four possible decisions that could be made by the user:
Is it possible for seqan3-based programs to use only one CPU? I tried setting
seqan3::contrib::bgzf_thread_count
to 1, but the BAM-reading program still uses 200% CPU according to GNU time: one main thread and one for seqan3's decompression. Looking at the code, setting seqan3::contrib::bgzf_thread_count to 0 would not be supported, correct?I'm trying to make a CLI like that of samtools: using one CPU by default, with an option to specify additional CPUs. Is there a way to do that? Thanks! @eseiler