Closed jkbonfield closed 1 year ago
And now we can do e.g. this change to the constant (not even optimised per data set) table, which prefers to keep AT and GC mutations with the same code and is symmetric.
diff --git a/cram/cram_structs.h b/cram/cram_structs.h
index cbb226b..7df52f3 100644
--- a/cram/cram_structs.h
+++ b/cram/cram_structs.h
@@ -88,7 +88,7 @@ struct hFILE;
#define BASES_PER_SLICE (SEQS_PER_SLICE*500)
#define SLICE_PER_CNT 1
-#define CRAM_SUBST_MATRIX "CGTNAGTNACTNACGNACGT"
+#define CRAM_SUBST_MATRIX "TCGNGATNCTANAGCNACGT"
#define MAX_STAT_VAL 1024
Tested with HiSeqX, NovaSeq, ONT (R10.4), Revio, Ultima and Element this gives improvements to BS compression ratio across the board. It varied between 2.4% and 10.7%. Not a PR yet as I don't know what the general best strategy is and maybe auto-tuning is correct instead, but it's basically zero cost. Overall file size reduction is much smaller though as we're not usually dominanted by BS data series.
The change looks good, although I see that the back-stop loop termination in sub_idx()
no longer works because h->substitution_matrix
is not NUL-terminated. Of course it should never be hit anyway because the content of the substitution matrix and the values searched for are both fully controlled by HTSlib, but might it be good to change *key
to i < 4
anyway?
The change looks good, although I see that the back-stop loop termination in
sub_idx()
no longer works becauseh->substitution_matrix
is not NUL-terminated. Of course it should never be hit anyway because the content of the substitution matrix and the values searched for are both fully controlled by HTSlib, but might it be good to change*key
toi < 4
anyway?
Can make that trivial change.
Should we also consider just changing the #define to a better one? I've been doing some randomised+hill-climb trials this morning and it seems to consistently pick another matrix for working better. I was just about to try it out on a larger range of instrument types.
Something like optimising it per container is quite a bit of work, but simply changing the default matrix has absolutely zero CPU cost and if it's on-average a win then it's easy to do.
I expect that depends on how universal the result turns out to be.
Some data for human GRCh38 alignments, from a mixture of technologies and a mixture of regions of chr1. So not a full analysis, but a broad spectrum at least (as far as human goes). Almost certainly we could do better on extreme things like malaria, but the previous matrix was totally arbitrary too.
CRAM 3.0, normal mode
CGTNAGTNACTNACGNACGT (default)
BLOCK 31 1696098 340253 20.06% rR BS hiseqx pcr-
BLOCK 31 1456710 293218 20.13% rR BS hiseqx pcr+
BLOCK 31 1182106 221575 18.74% rR BS novoseq pcr-
BLOCK 31 1351779 248701 18.40% rR BS novaseq pcr+
BLOCK 31 1852956 370204 19.98% r. BS aviti
BLOCK 31 1118769 218360 19.52% rR BS ont
BLOCK 31 238916 33304 13.94% gr BS revio
BLOCK 31 138251 27440 19.85% rR BS ultima
TCGNGATNCTANAGCNACGT (symmetric; A<>T and C<>G)
BLOCK 31 1696098 322227 19.00% rR BS hiseqx pcr-
BLOCK 31 1456710 285670 19.61% rR BS hiseqx pcr+
BLOCK 31 1182106 208140 17.61% Rr BS novoseq pcr-
BLOCK 31 1351779 236441 17.49% Rr BS novaseq pcr+
BLOCK 31 1852956 361812 19.53% r. BS aviti
BLOCK 31 1118769 201984 18.05% r BS ont
BLOCK 31 238916 32828 13.74% gr BS revio
BLOCK 31 138251 26924 19.47% rR BS ultima
CGTNGTANCATNGCANACGT
BLOCK 31 1696098 325672 19.20% rR BS hiseqx pcr-
BLOCK 31 1456710 285539 19.60% rR BS hiseqx pcr+
BLOCK 31 1182106 197163 16.68% R BS novoseq pcr-
BLOCK 31 1351779 216459 16.01% R BS novaseq pcr+
BLOCK 31 1852956 372520 20.10% r. BS aviti
BLOCK 31 1118769 200739 17.94% r BS ont
BLOCK 31 238916 32728 13.70% gr BS revio
BLOCK 31 138251 26203 18.95% R BS ultima
A more aggressive compression via CRAM 3.1,small,use_arith,level=7
CGTNAGTNACTNACGNACGT version=3.1,small,use_arith,level=7
BLOCK 31 1696098 328442 19.36% 1A BS hiseqx pcr-
BLOCK 31 1456710 283135 19.44% 1A BS hiseqx pcr+
BLOCK 31 1182106 209511 17.72% 1Aa BS novoseq pcr-
BLOCK 31 1351779 236854 17.52% 1Aa BS novaseq pcr+
BLOCK 31 1852956 365697 19.74% 0Aa BS aviti
BLOCK 31 1118769 216933 19.39% A0 BS ont
BLOCK 31 238916 33269 13.92% G BS revio
BLOCK 31 138251 26864 19.43% 0A BS ultima
TCGNGATNCTANAGCNACGT version=3.1,small,use_arith,level=7
BLOCK 31 1696098 315985 18.63% 01A BS hiseqx pcr-
BLOCK 31 1456710 278366 19.11% 1A BS hiseqx pcr+
BLOCK 31 1182106 197448 16.70% 1A BS novoseq pcr-
BLOCK 31 1351779 225921 16.71% 1A BS novaseq pcr+
BLOCK 31 1852956 357493 19.29% 0Aa BS aviti
BLOCK 31 1118769 201114 17.98% a0 BS ont
BLOCK 31 238916 32439 13.58% G BS revio
BLOCK 31 138251 26087 18.87% A BS ultima
CGTNGTANCATNGCANACGT version=3.1,small,use_arith,level=7
BLOCK 31 1696098 318739 18.79% 01A BS hiseqx pcr-
BLOCK 31 1456710 279371 19.18% 01A BS hiseqx pcr+
BLOCK 31 1182106 190371 16.10% 1A BS novoseq pcr-
BLOCK 31 1351779 209635 15.51% 1A BS novaseq pcr+
BLOCK 31 1852956 367808 19.85% 0Aa BS aviti
BLOCK 31 1118769 199859 17.86% a0 BS ont
BLOCK 31 238916 32282 13.51% G BS revio
BLOCK 31 138251 25423 18.39% A BS ultima
The 3 matrices look like:
CGTNAGTNACTNACGNACGT
A : C G T N
C : A G T N
G : A C T N
T : A C G N
N : A C G T
TCGNGATNCTANAGCNACGT
A : T C G N
C : G A T N
G : C T A N
T : A G C N
N : A C G T
TCGNAGTNTCANAGCNACGT
A : T C G N
C : A G T N
G : T C A N
T : A G C N
N : A C G T
So the default is just an ACGT with the ref base absent. It generally clusters by nucleotide so BS=0 is mostly A, BS=1 is mostly C.
The second one I tried, which I quoted above, was just a manual attempt at getting A<>T and C<>G together and having complementarity for the other columns too. It ends up with ACGT in every column (ie BS=0, or in BS=1).
The third matrix was determined by repeated random trials and hill climbing matrix element swapping on a mixed data set. It appears to have chosen columns that are all A/T or C/G for 2 of the 3 columns (the best it can do as you cannot get this for BS 0, 1 and 2). So it seems likely this is matching regions together that are AT-rich or GC-rich.
It's noticable that the CRAM 3.0 compression switches from order-0 across the board to sometimes preferring order-1 codecs (particularly novaseq). I saw that at compress level 9 (not shown) too, but the difference was small so likely the auto-tuning didn't consider it worth the extra CPU. Letting it use the arithmetic coder generally didn't use it universally, often preferring r4x16 rans.
I'm thinking of the 3rd option, although AVITI is marginally ahead with the 2nd. It's not a huge difference there though, and novaseq is probably still dominant in quantity and it has a big preference for the 3rd. We're also less likely to make any meaningful difference to file size on the platforms that don't aggressively quantise their quality values, as the BS data series becomes a tiny percentage of overall file size.
It's still small fry and icing on the cake frankly. For CRAM v3.1 on Novaseq PCR-free test set, the matrix change saved 9.2% for BS, but that's just 0.06% of the CRAM file size. For CRAM 3.0 it was 11% and 0.07%. Not exactly stellar!
I've checked that picard and old samtools understand the revised matrix, so it should be OK.
The matrix is meant to turn ref + code into seq. Eg ref C and BS code 1 may mean seq is T. Instead of writing the codes for the non-ref bases in order ACGTN, we wrote the Nth base number in numerical order of the codes.
For ref C + BS code we have 4 alternatives A,G,T and N (C->C is absent as it's not a substitution). So e.g. we may have C: 0=G 1=T 2=A 3=N. We were writing GTAN as 01 10 00 11, from A(c)GTN. We should have been writing the code numbers in A(c)GTN order hence 10 00 01 11.
However, we don't actually change or optimise this in htslib, so it's hard coded in cram_structs.h.
Reformatting it's:
That basically boils down to 0123 (00 01 10 11 or 0x3b) for all rows. The incorrect order of writing the table made no difference as every row is sorted by both code 0,1,2,3 and nucleotide A,C,G,T,N.