samtools / htslib

C library for high-throughput sequencing data formats
Other
801 stars 446 forks source link

Add a mapping of Zlib to Libdeflate compression levels for BGZF. #1488

Closed jkbonfield closed 2 years ago

jkbonfield commented 2 years ago

Libdeflate goes up to compression level 12, with the last 3 levels using a much slower optimal parsing technique. We reserve bgzip levels 8 and 9 for two of these slow modes, and spread the remainder out across levels libdeflate 1-9.

We map 1-9 to 1,2,3, 5,6,7,8, 10,12. This was designed so that the files are generally smaller than their zlib counterparts while still being faster (except for zlib levels 8 and 9 as noted above).

This is based on benchmarks (see below) for various data sets.

Hence users will find bgzf -l8 and -l9 considerably slower than before. Ideally we'd support bgzip -l10 to -l12, but this complicates several tools and the htslib format string which assumes it's level+'0' in various places (not just the library, but also the command line tools). This was the simpler and safer option. Realistically no one uses level 9 unless they want maximum compression, and now they're getting it once again.

Fixes #1477

CPU time (threaded, but total user CPU via time -f "%U") and file size in bytes.

  1. 1GB of Illumina NovaSeq BAM (NovaSeq.10m.bam)
             Libdeflate              Zlib
      0      0.99    1000474917      0.68    1000474917
      1      9.80    183521324       20.29   213827245       >1
      2      14.17   179046201       21.87   205485380       >1
      3      15.20   175877610       26.67   195469541       >1
      4      16.28   172991407       29.80   176019215       ~3
      5      19.36   169087724       38.52   169202888       ~4
      6      23.27   165900144       56.30   164719424       >7
      7      32.50   163766923       72.45   163327258       ~7
      8      57.16   161643808       148.61  160866537       ~9
      9      74.91   160953697       295.37  159689582       >10
      10     303.28  157126803
      11     477.66  155323612
      12     659.36  153756096

As an experiment I added zstd here too, with various block sizes or none at all, at level 9, 12 and 19:

    -9 (unblocked)  146938203
    -B1048576 -b9   149801386
    -B65536 -b9     160527533 (in ~29s)
    -B5536 -b12     157251800 (best 64k blocked zstd, in ~95s)
    -B5536 -b19     144923698 (best 64k blocked zstd, in ~617s)

1GB of ~4000 sample VCF ALL.chr20.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz

       libdeflate               zlib
0      0.66     1000474917      0.70    1000474917
1      4.29     23702390        6.50    31624075        >1
2      6.15     22689795        6.96    30023229        >1
3      6.60     22086968        8.37    28216644        >1
4      6.89     21741126        13.88   21861705        ~4
5      7.26     21312709        15.91   20967100        >6
6      8.29     20575904        21.01   19966632        ~7
7      11.67    19817402        27.06   19529414        >8
8      20.15    19082333        59.78   18349376        >10
9      26.82    18536813        104.85  17957219        >10
10     104.32   17748987
11     168.52   17297625
12     273.45   16916786

1GB of 1 sample VCF; many verbose INFO fields HG002_GRCh38_1_22_v4.2_benchmark.vcf.gz

       libdeflate               zlib
0      0.66     1000474917      0.73    1000474917
1      5.17     89779592        9.36    88208451        >2
2      7.87     78071190        9.09    76125060        >3
3      9.16     70155649        8.91    71063897        >3
4      8.00     61555631        17.29   67974286        >4
5      9.26     59090268        18.28   59819372        >5
6      11.24    56259793        20.25   55524441        >7
7      13.33    53421543        21.65   54944394        >7
8      19.12    51953892        26.07   53725388        >7
9      21.17    51870998        28.17   53714621        >7
10     140.56   52822252
11     240.26   50724685
12     452.39   50135214

There is some oddity here with libdeflate level 10 being poorer than level 8 while still being much slower! This is probably some quirk of excessive data redundancy.

This data really shows the benefit of zstd instead. Such highly redundant data hugely speeds up with zstd -9 taking approx 8s to encode (when using 64KB blocks) at a size of 49792035, so libdeflate lvl 5 speeds at better than libdeflate size. With block sizes of 1MB that drops from ~50MB to ~39MB too. (Zstd doesn't help nearly as much on the other data sets, so this is likely an excessive redundancy thing.)


Single sample GIAB chr1 bcftools output (85MB worth); more succint

       libdeflate               zlib
0      0.04     85086248        0.08    85086248
1      0.84     18635065        1.61    21071717        >1
2      1.38     17914684        1.84    20227900        >1
3      1.44     17581996        2.05    19477263        >1
4      1.57     17231445        2.58    18139642        >2
5      1.87     16065221        3.06    16855325        >5
6      2.24     15550661        4.05    16433775        >5
7      2.90     15265882        4.63    16084926        >5
8      4.99     14650624        7.45    15784360        >6
9      5.86     14615736        7.49    15778425        >6
10     21.15    14238504
11     29.47    14188808
12     35.88    14180049

1GB of R10 ONT fastq

       libdeflate               zlib
0      0.60     1000474917      0.68    1000474917
1      17.48    507151512       39.93   522788722       >1
2      27.29    490892251       42.81   513743184       >1
3      32.00    486330031       52.95   506652197       ~1
4      38.66    483198216       54.37   501296019       >2
5      43.60    479970547       78.50   498174825       >2
6      60.20    478146566       137.41  494118811       >2
7      87.72    476998683       196.25  493161028       >2
8      109.78   476648496       249.70  493162311       >2
9      110.21   476658031       249.02  493162210       >2
10     213.13   459363967
11     250.79   457698243
12     287.86   457132946
jkbonfield commented 2 years ago

Also curious is how BCF is a much poorer format to compress for single sample data than VCF (but better on multi-sample).

Compression of BCF is 15% and 12% larger than VCF at bgzf level 7 for the single-sample data, but VCF.gz is 19% larger than BCF for the 4000-sample data. It's not clear why this is so. (Speed wise BCF was 30-90% faster to encode, but likely better on decode.)