Closed jkbonfield closed 1 year ago
Local test file = ~/scratch/data/novaseq.10m.bam
Dev samtools view -O CRAM,version=3.1,archive
real 0m31.351s
user 4m1.710s
sys 0m4.455s
BLOCK 11 331576651 25808364 7.78% tok3-rans RN
BLOCK 11 63157368 4589064 7.27% tok3-arith RN
-rw-r--r-- 1 jkb team117 162450092 Feb 3 15:16 /tmp/_.cram
The above shows the problem with samtools cram-size
reporting most of the name tokenised data has reverted to rANS.
New samtools view -O CRAM,version=3.1,archive
real 0m34.685s
user 4m27.206s
sys 0m5.371s
BLOCK 11 394734019 28611420 7.25% tok3-arith RN
-rw-r--r-- 1 jkb team117 160661810 Feb 3 15:20 /tmp/_.cram
Fixes the accidental switch from tok3-arith to tok3-rans. More CPU, but 6% less RN space and 1.1% less total CRAM.
New samtools view -O CRAM,version=3.1,archive; + htscodecs PR#73
real 0m34.508s
user 4m25.810s
sys 0m4.880s
BLOCK 11 394734019 26432333 6.70% tok3-arith RN
-rw-r--r-- 1 jkb team117 158475714 Feb 3 15:19 /tmp/_.cramr
RN blocks are 7.6% smaller, meaning archive CRAM is 2.45% smaller than dev branch. The extra CPU is now actually useful. (Anyway CPU isn't the major factor for archive mode.)
The name tokeniser has a rANS vs Arithmetic coder choice as a parameter (in the "strat" variable). We lacked this distinction when learning which method works best, so in the choice of toka (tok3+arith) vs bzip2 vs gzip etc we selected tok3 and switched back to strat 0, disabling the arithmetic coder.
This only affects archive mode, or where a user explicitly used eg "samtools view -O cram,version=3.1,use_arith".