bgzip performance drop from v1.16

koriege commented 2 months ago

Hi,

I stumbled across an issue when testing multiple versions of htslib/bgzip. With release of a blasting fast decompression algorithm in 2023 https://github.com/mxmlnkn/rapidgzip, which achives a throughput of 1.5GB/s on my machine, I observed a massive performance difference in data compression for bgzip versions <= 1.16 and >= 1.17, not attributable to the change of default compression level (from v1.16) or cold cache effects.

v1.13 - v1.15

rapidgzip -P 8 -kdc --import-index fastq.gzri fastq.gz | bgzip -l 6 -kc -@ 128 | pv > /dev/null
4.40GiB 0:00:12 [ 355MiB/s]

v1.16 (with adapted compression level)

rapidgzip -P 8 -kdc --import-index fastq.gzri fastq.gz | bgzip -l 5 -kc -@ 128 | pv > /dev/null
4.40GiB 0:00:12 [ 351MiB/s]

v1.17 - v1.19 (with adapted compression level)

rapidgzip -P 8 -kdc --import-index fastq.gzri fastq.gz | bgzip -l 5 -kc -@ 128 | pv > /dev/null
4.40GiB 0:00:17 [ 260MiB/s]

jkbonfield commented 2 months ago

Did you build all of them with libdeflate? samtools --version should report your libraries found ("libdeflate=yes"). (Assuming you have the entire ecosystem installed. If not strings the bgzip binary and grep would tell you)

I can't see much difference here:

$ v=1.19.1;for i in 1 2 3;do /software/common-apps/samtools-team-managed/htslib/$v/bin/bgzip -@64 -c -l5 < /tmp/_10g | pv > /dev/null;done
3.16GiB 0:00:03 [ 850MiB/s] [     <=>                                                                  ]
3.16GiB 0:00:04 [ 757MiB/s] [       <=>                                                                ]
3.16GiB 0:00:04 [ 747MiB/s] [       <=>                                                                ]

$ v=1.16;for i in 1 2 3;do /software/common-apps/samtools-team-managed/htslib/$v/bin/bgzip -@64 -c -l5 < /tmp/_10g | pv > /dev/null;done
3.16GiB 0:00:03 [ 847MiB/s] [     <=>                                                                  ]
3.16GiB 0:00:04 [ 714MiB/s] [       <=>                                                                ]
3.16GiB 0:00:04 [ 727MiB/s] [       <=>                                                                ]

That said, it's not gaining much beyond 32 threads (this is a 64-core EPYC).

koriege commented 2 months ago

Thank you for the reply. I can confirm the peak performance using 32 threads on our EPYC, too. In the meantime I was investigating this issue further. I compiled htslib with and without libdeflate, turned off NUMA and SMT, set cpu frequencies constant and tested from NVMe SSD and /dev/shm. Unfortunately under all test cases I made the the same observation as stated above when using rapidgzip to decompress and bgzip to compress a fastq data stream subsequently (or within a longer pipeline). When compressing an uncompressed file or by utilizing cat there is no observable difference in throughput. Now I am a little bit puzzled, what actually causes the behavior - bgzip or rapidgzip. Can you confirm a difference in performance on your system when using rapidgzip? The version I used was the latest installed via pip (v0.13.1).

jkbonfield commented 2 months ago

I haven't tried rapidgzip yet, but you could try adding mbuffer in there too (ie uncompress | mbuffer | compress) as that will provide a bigger buffer and help decouple the threading. Otherwise you're bound by the small pipe buffer size.

jkbonfield commented 2 months ago

I can confirm I see a difference, but only when reading from a pipe. Reading direct from a file it's the same speed. I'm not sure yet what would have affected that. Maybe some additional error checking somewhere. I'd need to do some bisecting to figure it out.

jkbonfield commented 2 months ago

So it's down to 4957188734d8d977f3b494983e28156af9a258e which added autodetection of text data and ensured bgzip blocks ended on newlines for text based formats. This improves random access capabilities for certain file types.

However even with --binary to enable the old behaviour, it still doesn't regain the speed again. This is perhaps in part due to switching from UNIX I/O to an hfile with additional layers of indirection. That said, hfile could be faster on some filesystems, particularly if we're using small block sizes (which it will be, defaulting to 0xff00 I think. It doesn't help that hfiles default block size is 32768, so less than a single BGZF block.

Adding an explicit hfile_set_blksize(f_src, 1e6) call in there regains the lost speed for binary mode. Oddly this is only exported in hfile_internal.h. I'm not sure why we didn't make it public. We can only change it otherwise with hts_set_opt which takes an htsFile rather than an hFILE, which is another level still of redirection. (That said, bgzip is internal to htslib so I don't care about using the internal function.)

I am a bit surprised to see such a performance difference with hfile though.

Commit 2ff03d34

@ ; for i in `seq 1 5`;do cat < ~/scratch/data/enwik9 | ~/work/samtools_master/htslib/bgzip -l5 -@64 |pv > /dev/null;done
 323MiB 0:00:00 [ 466MiB/s] [ <=>                                                                        ]
 323MiB 0:00:00 [ 502MiB/s] [ <=>                                                                        ]
 323MiB 0:00:00 [ 500MiB/s] [ <=>                                                                        ]
 323MiB 0:00:00 [ 423MiB/s] [ <=>                                                                        ]
 323MiB 0:00:00 [ 517MiB/s] [ <=>                                                                        ]

Commit e4957188

@ ; for i in `seq 1 5`;do cat < ~/scratch/data/enwik9 | ~/work/samtools_master/htslib/bgzip -l5 -@64 |pv > /dev/null;done
 323MiB 0:00:00 [ 373MiB/s] [ <=>                                                                        ]
 323MiB 0:00:00 [ 397MiB/s] [ <=>                                                                        ]
 323MiB 0:00:00 [ 410MiB/s] [ <=>                                                                        ]
 323MiB 0:00:00 [ 398MiB/s] [ <=>                                                                        ]
 323MiB 0:00:00 [ 402MiB/s] [ <=>

Commit e4957188 with hfile_set_blksize(f_src, 1e6)

@ ; for i in `seq 1 5`;do cat < ~/scratch/data/enwik9 | ~/work/samtools_master/htslib/bgzip -l5 -@64 |pv > /dev/null;done
 323MiB 0:00:00 [ 482MiB/s] [ <=>                                                                        ]
 323MiB 0:00:00 [ 431MiB/s] [ <=>                                                                        ]
 323MiB 0:00:00 [ 482MiB/s] [ <=>                                                                        ]
 323MiB 0:00:00 [ 504MiB/s] [ <=>                                                                        ]
 323MiB 0:00:00 [ 422MiB/s] [ <=>                                                                        ]

Obviously some variability, but basically rescues the impact. (Adding --binary didn't make much difference).

koriege commented 2 months ago

Thank you for figuring this out! Does your pull request, which increases the blocksize to 1MB, affect the binary/legacy mode only?

jkbonfield commented 2 months ago

No, it modifies the hfile buffer regardless of binary / text mode. The speed difference between the two isn't that significant compared to the overheads of the buffer.

Further diagnosis shows the problem is due to pipes. Hfile does a stat call on the fd to determine the filesystem recommended buffer size. However for a pipe this comes back as 4kb. Using strace I see it's mixing big reads (likely a direct read into the destination buffer) plus small reads (<= 4kb), which are presumably refilling the hfile buffer. Compared to before bgzip always direct read() calls, but these were consistently around 64kb and so good enough performance. In short, the change to use hread lead to maybe 50% more underlying read system calls when reading from a pipe. This was an unintended consequence.

Htslib maybe should just have a minimum buffer size of something bigger so it doesn't reduce the size to something so tiny. 4Kb block sizes are a real anachronism and harking back to bygone eras of early unixen.

The other side of this is that even when the filesystem recommends a 4Mb buffer size (as lustre does), there is a hard cutoff of 32kb in htslib. This limit was added over concerns of high memory usage when doing many-way file merges with lots of file handles open. I have some sympathy for this, however having the maximum default detected block size being smaller than a typical single bgzf block seems like an inefficiency as we just end up doing multiple reads anyway so we can uncompress an entire bgzf block. Our inflight I/O based memory therefore is probably around 400kb for bam/vcf.gz, making the hfile layer largely an irrelevance. I'm tempted to say minimum 64Kb maximum 1Mb, with the option of being able to selectively increase as desired (eg bgzip).

Maybe @jmarshall has views on this as most of this code originated in his era, perhaps mostly from himself.

Edit: oops I'm wrong on my maths. 64k is the uncompressed size of a bgzf block, not compressed, so it's not as stark as it looks for current memory usage. Nevertheless, I think honouring the system stat return of 4kb for a pipe seems way too small.

samtools / htslib

bgzip performance drop from v1.16 #1767