samtools / htslib

C library for high-throughput sequencing data formats
Other
789 stars 447 forks source link

FIx a bug in the codec learning algorithm for TOKA #1559

Closed jkbonfield closed 1 year ago

jkbonfield commented 1 year ago

The name tokeniser has a rANS vs Arithmetic coder choice as a parameter (in the "strat" variable). We lacked this distinction when learning which method works best, so in the choice of toka (tok3+arith) vs bzip2 vs gzip etc we selected tok3 and switched back to strat 0, disabling the arithmetic coder.

This only affects archive mode, or where a user explicitly used eg "samtools view -O cram,version=3.1,use_arith".

jkbonfield commented 1 year ago

Local test file = ~/scratch/data/novaseq.10m.bam

Dev samtools view -O CRAM,version=3.1,archive

  real  0m31.351s
  user  4m1.710s
  sys   0m4.455s
  BLOCK       11    331576651     25808364   7.78% tok3-rans   RN
  BLOCK       11     63157368      4589064   7.27% tok3-arith  RN
  -rw-r--r-- 1 jkb team117 162450092 Feb  3 15:16 /tmp/_.cram

The above shows the problem with samtools cram-size reporting most of the name tokenised data has reverted to rANS.

New samtools view -O CRAM,version=3.1,archive

  real  0m34.685s
  user  4m27.206s
  sys   0m5.371s
  BLOCK       11    394734019     28611420   7.25% tok3-arith  RN
  -rw-r--r-- 1 jkb team117 160661810 Feb  3 15:20 /tmp/_.cram

Fixes the accidental switch from tok3-arith to tok3-rans. More CPU, but 6% less RN space and 1.1% less total CRAM.

New samtools view -O CRAM,version=3.1,archive; + htscodecs PR#73

 real   0m34.508s
 user   4m25.810s
 sys    0m4.880s
 BLOCK       11    394734019     26432333   6.70% tok3-arith  RN
 -rw-r--r-- 1 jkb team117 158475714 Feb  3 15:19 /tmp/_.cramr

RN blocks are 7.6% smaller, meaning archive CRAM is 2.45% smaller than dev branch. The extra CPU is now actually useful. (Anyway CPU isn't the major factor for archive mode.)