samtools / htslib

C library for high-throughput sequencing data formats
Other
812 stars 447 forks source link

a patch to fix some broken 'wbu' case #1101

Open wangyugui opened 4 years ago

wangyugui commented 4 years ago

Hi,

We still need "wbu" mode for pipe output, such as bwa | samtools fixmate | samtools sort

but now 'samtools view -u ' is changed from 'wbu' to 'wb0', and the 'wbu' support in the source is broken.

'wb0' does NOT compress data, but there is still a slow CRC function.

We need NOT CRC fuctions in some case. 1)for pipe output, 2) file system such as XFS and btrfs already have CRC checksum.

diff --git a/bgzf.c b/bgzf.c
index 0a76676..4fe4592 100644
--- a/bgzf.c
+++ b/bgzf.c
@@ -1779,6 +1779,7 @@ static inline int lazy_flush(BGZF *fp)
 int bgzf_flush(BGZF *fp)
 {
     if (!fp->is_write) return 0;
+    if (!fp->is_compressed ) return 0;
 #ifdef BGZF_MT
     if (fp->mt) {
         int ret = 0;

This is a dirty patch that fixed some broken path. But I'm not sure whether it is right.

Best Regards Wang Yugui

pd3 commented 4 years ago

Out of curiosity, have you done any benchmarks? When you say slow CRC, I am wondering how slow it really is.

wangyugui commented 4 years ago

@pd3

Yet no official benchmarks, but there are some cases.

case1: In a pipline such as 'bwa | samtools fixmate | samtools sort |samtools markdup', we notice that sometime 'crc' of some process( 'wb0' ) is become the bottleneck. that process is 100% CPU and pstack show it is running 'CRC'.

our server have 40/80 CPU/core.

case2: some small tool I wrote in 'wb0' mode, perf tool report

  15.55%  kgutil   libz.so.1.2.7       [.] crc32
  12.53%  kgutil   kgutil              [.] fai_retrieve.isra.4
   9.46%  kgutil   kgutil              [.] bgzf_getc
wangyugui commented 4 years ago

And we need 'wbu' mode for btrfs transparent compress too?

daviesrob commented 4 years ago

The BAM specification says that BAM files are wrapped in BGZF blocks, and BGZF always includes a CRC checksum. Uncompressed BAM file is still BGZF-wrapped, the difference is that it uses a deflate option which stores the data without compressing it. BGZF blocks without a valid CRC will be rejected by some tools (including ones using HTSlib) even if uncompressed.

What you may have found is that HTSlib will happily read BAM data without the BGZF wrapper (and so avoids the CRC overhead). Currently there's no way to write data like this though, although I do occasionally entertain the idea specifically for the case of writing to pipes. If we did enable this, the data would not be strictly valid BAM and may be rejected by tools that can't detect the unwrapped format. It would also remove some of the data integrity protection (the CRC blocks and the EOF block used to indicate the the file is complete). It would always need to be enabled as an option; we could never make it the default behaviour even if we detected that the output is a pipe.

If you're finding the CRC checks slow, for now you can build HTSlib with libdeflate support. On modern Debian/Ubuntu systems you can apt install libdeflate-dev and HTSlib's configure script should find and use it; otherwise see HTSlib's INSTALL file. Libdeflate includes much faster CRC and deflate implementations, and considerably speeds up HTSlib (see benchmarks here).

If you're storing files on disk, you should do so using CRAM instead of BAM. CRAM files can be as small as half the size of the equivalent BAM (examples here) so save a considerable amount of space.

wangyugui commented 4 years ago

we need both solution for these two cases 1) fast access for intermediate file. 2) small space for final file.

ssd/nvme become faster, and more cpu/core is coming, we need a solution for '1) fast access for intermediate file.'