Closed jkbonfield closed 2 years ago
Also curious is how BCF is a much poorer format to compress for single sample data than VCF (but better on multi-sample).
Compression of BCF is 15% and 12% larger than VCF at bgzf level 7 for the single-sample data, but VCF.gz is 19% larger than BCF for the 4000-sample data. It's not clear why this is so. (Speed wise BCF was 30-90% faster to encode, but likely better on decode.)
Libdeflate goes up to compression level 12, with the last 3 levels using a much slower optimal parsing technique. We reserve bgzip levels 8 and 9 for two of these slow modes, and spread the remainder out across levels libdeflate 1-9.
We map 1-9 to 1,2,3, 5,6,7,8, 10,12. This was designed so that the files are generally smaller than their zlib counterparts while still being faster (except for zlib levels 8 and 9 as noted above).
This is based on benchmarks (see below) for various data sets.
Hence users will find bgzf -l8 and -l9 considerably slower than before. Ideally we'd support bgzip -l10 to -l12, but this complicates several tools and the htslib format string which assumes it's level+'0' in various places (not just the library, but also the command line tools). This was the simpler and safer option. Realistically no one uses level 9 unless they want maximum compression, and now they're getting it once again.
Fixes #1477
CPU time (threaded, but total user CPU via time -f "%U") and file size in bytes.
As an experiment I added zstd here too, with various block sizes or none at all, at level 9, 12 and 19:
1GB of ~4000 sample VCF ALL.chr20.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
1GB of 1 sample VCF; many verbose INFO fields HG002_GRCh38_1_22_v4.2_benchmark.vcf.gz
There is some oddity here with libdeflate level 10 being poorer than level 8 while still being much slower! This is probably some quirk of excessive data redundancy.
This data really shows the benefit of zstd instead. Such highly redundant data hugely speeds up with zstd -9 taking approx 8s to encode (when using 64KB blocks) at a size of 49792035, so libdeflate lvl 5 speeds at better than libdeflate size. With block sizes of 1MB that drops from ~50MB to ~39MB too. (Zstd doesn't help nearly as much on the other data sets, so this is likely an excessive redundancy thing.)
Single sample GIAB chr1 bcftools output (85MB worth); more succint
1GB of R10 ONT fastq