samtools / htslib

C library for high-throughput sequencing data formats
Other
789 stars 447 forks source link

Fix Cram compression container substitution matrix generation. #1562

Closed jkbonfield closed 1 year ago

jkbonfield commented 1 year ago

The matrix is meant to turn ref + code into seq. Eg ref C and BS code 1 may mean seq is T. Instead of writing the codes for the non-ref bases in order ACGTN, we wrote the Nth base number in numerical order of the codes.

For ref C + BS code we have 4 alternatives A,G,T and N (C->C is absent as it's not a substitution). So e.g. we may have C: 0=G 1=T 2=A 3=N. We were writing GTAN as 01 10 00 11, from A(c)GTN. We should have been writing the code numbers in A(c)GTN order hence 10 00 01 11.

However, we don't actually change or optimise this in htslib, so it's hard coded in cram_structs.h.

#define CRAM_SUBST_MATRIX "CGTNAGTNACTNACGNACGT"

Reformatting it's:

A: CGTN
C:A GTN
G:AC TN
T:ACG N
N:ACGT

That basically boils down to 0123 (00 01 10 11 or 0x3b) for all rows. The incorrect order of writing the table made no difference as every row is sorted by both code 0,1,2,3 and nucleotide A,C,G,T,N.

jkbonfield commented 1 year ago

And now we can do e.g. this change to the constant (not even optimised per data set) table, which prefers to keep AT and GC mutations with the same code and is symmetric.

diff --git a/cram/cram_structs.h b/cram/cram_structs.h
index cbb226b..7df52f3 100644
--- a/cram/cram_structs.h
+++ b/cram/cram_structs.h
@@ -88,7 +88,7 @@ struct hFILE;
 #define BASES_PER_SLICE (SEQS_PER_SLICE*500)
 #define SLICE_PER_CNT  1

-#define CRAM_SUBST_MATRIX "CGTNAGTNACTNACGNACGT"
+#define CRAM_SUBST_MATRIX "TCGNGATNCTANAGCNACGT"

 #define MAX_STAT_VAL 1024

Tested with HiSeqX, NovaSeq, ONT (R10.4), Revio, Ultima and Element this gives improvements to BS compression ratio across the board. It varied between 2.4% and 10.7%. Not a PR yet as I don't know what the general best strategy is and maybe auto-tuning is correct instead, but it's basically zero cost. Overall file size reduction is much smaller though as we're not usually dominanted by BS data series.

daviesrob commented 1 year ago

The change looks good, although I see that the back-stop loop termination in sub_idx() no longer works because h->substitution_matrix is not NUL-terminated. Of course it should never be hit anyway because the content of the substitution matrix and the values searched for are both fully controlled by HTSlib, but might it be good to change *key to i < 4 anyway?

jkbonfield commented 1 year ago

The change looks good, although I see that the back-stop loop termination in sub_idx() no longer works because h->substitution_matrix is not NUL-terminated. Of course it should never be hit anyway because the content of the substitution matrix and the values searched for are both fully controlled by HTSlib, but might it be good to change *key to i < 4 anyway?

Can make that trivial change.

Should we also consider just changing the #define to a better one? I've been doing some randomised+hill-climb trials this morning and it seems to consistently pick another matrix for working better. I was just about to try it out on a larger range of instrument types.

Something like optimising it per container is quite a bit of work, but simply changing the default matrix has absolutely zero CPU cost and if it's on-average a win then it's easy to do.

daviesrob commented 1 year ago

I expect that depends on how universal the result turns out to be.

jkbonfield commented 1 year ago

Some data for human GRCh38 alignments, from a mixture of technologies and a mixture of regions of chr1. So not a full analysis, but a broad spectrum at least (as far as human goes). Almost certainly we could do better on extreme things like malaria, but the previous matrix was totally arbitrary too.

CRAM 3.0, normal mode

CGTNAGTNACTNACGNACGT (default)
BLOCK       31      1696098       340253  20.06% rR      BS hiseqx pcr- 
BLOCK       31      1456710       293218  20.13% rR      BS hiseqx pcr+ 
BLOCK       31      1182106       221575  18.74% rR      BS novoseq pcr-
BLOCK       31      1351779       248701  18.40% rR      BS novaseq pcr+
BLOCK       31      1852956       370204  19.98% r.      BS aviti           
BLOCK       31      1118769       218360  19.52% rR      BS ont     
BLOCK       31       238916        33304  13.94% gr      BS revio           
BLOCK       31       138251        27440  19.85% rR      BS ultima          

TCGNGATNCTANAGCNACGT (symmetric; A<>T and C<>G)
BLOCK       31      1696098       322227  19.00% rR      BS hiseqx pcr- 
BLOCK       31      1456710       285670  19.61% rR      BS hiseqx pcr+ 
BLOCK       31      1182106       208140  17.61% Rr      BS novoseq pcr-
BLOCK       31      1351779       236441  17.49% Rr      BS novaseq pcr+
BLOCK       31      1852956       361812  19.53% r.      BS aviti           
BLOCK       31      1118769       201984  18.05% r       BS ont     
BLOCK       31       238916        32828  13.74% gr      BS revio           
BLOCK       31       138251        26924  19.47% rR      BS ultima          

CGTNGTANCATNGCANACGT
BLOCK       31      1696098       325672  19.20% rR      BS hiseqx pcr- 
BLOCK       31      1456710       285539  19.60% rR      BS hiseqx pcr+ 
BLOCK       31      1182106       197163  16.68% R       BS novoseq pcr-
BLOCK       31      1351779       216459  16.01% R       BS novaseq pcr+
BLOCK       31      1852956       372520  20.10% r.      BS aviti           
BLOCK       31      1118769       200739  17.94% r       BS ont     
BLOCK       31       238916        32728  13.70% gr      BS revio           
BLOCK       31       138251        26203  18.95% R       BS ultima          

A more aggressive compression via CRAM 3.1,small,use_arith,level=7

CGTNAGTNACTNACGNACGT version=3.1,small,use_arith,level=7
BLOCK       31      1696098       328442  19.36% 1A      BS  hiseqx pcr- 
BLOCK       31      1456710       283135  19.44% 1A      BS  hiseqx pcr+ 
BLOCK       31      1182106       209511  17.72% 1Aa     BS  novoseq pcr-
BLOCK       31      1351779       236854  17.52% 1Aa     BS  novaseq pcr+
BLOCK       31      1852956       365697  19.74% 0Aa     BS  aviti       
BLOCK       31      1118769       216933  19.39% A0      BS  ont        
BLOCK       31       238916        33269  13.92% G       BS  revio       
BLOCK       31       138251        26864  19.43% 0A      BS  ultima      

TCGNGATNCTANAGCNACGT version=3.1,small,use_arith,level=7
BLOCK       31      1696098       315985  18.63% 01A     BS  hiseqx pcr- 
BLOCK       31      1456710       278366  19.11% 1A      BS  hiseqx pcr+ 
BLOCK       31      1182106       197448  16.70% 1A      BS  novoseq pcr-
BLOCK       31      1351779       225921  16.71% 1A      BS  novaseq pcr+
BLOCK       31      1852956       357493  19.29% 0Aa     BS  aviti       
BLOCK       31      1118769       201114  17.98% a0      BS  ont        
BLOCK       31       238916        32439  13.58% G       BS  revio       
BLOCK       31       138251        26087  18.87% A       BS  ultima      

CGTNGTANCATNGCANACGT version=3.1,small,use_arith,level=7
BLOCK       31      1696098       318739  18.79% 01A     BS  hiseqx pcr- 
BLOCK       31      1456710       279371  19.18% 01A     BS  hiseqx pcr+ 
BLOCK       31      1182106       190371  16.10% 1A      BS  novoseq pcr-
BLOCK       31      1351779       209635  15.51% 1A      BS  novaseq pcr+
BLOCK       31      1852956       367808  19.85% 0Aa     BS  aviti       
BLOCK       31      1118769       199859  17.86% a0      BS  ont        
BLOCK       31       238916        32282  13.51% G       BS  revio       
BLOCK       31       138251        25423  18.39% A       BS  ultima      

The 3 matrices look like:

CGTNAGTNACTNACGNACGT
A : C G T N
C : A G T N
G : A C T N
T : A C G N
N : A C G T

TCGNGATNCTANAGCNACGT
A : T C G N
C : G A T N
G : C T A N
T : A G C N
N : A C G T

TCGNAGTNTCANAGCNACGT
A : T C G N
C : A G T N
G : T C A N
T : A G C N
N : A C G T

So the default is just an ACGT with the ref base absent. It generally clusters by nucleotide so BS=0 is mostly A, BS=1 is mostly C.

The second one I tried, which I quoted above, was just a manual attempt at getting A<>T and C<>G together and having complementarity for the other columns too. It ends up with ACGT in every column (ie BS=0, or in BS=1).

The third matrix was determined by repeated random trials and hill climbing matrix element swapping on a mixed data set. It appears to have chosen columns that are all A/T or C/G for 2 of the 3 columns (the best it can do as you cannot get this for BS 0, 1 and 2). So it seems likely this is matching regions together that are AT-rich or GC-rich.

It's noticable that the CRAM 3.0 compression switches from order-0 across the board to sometimes preferring order-1 codecs (particularly novaseq). I saw that at compress level 9 (not shown) too, but the difference was small so likely the auto-tuning didn't consider it worth the extra CPU. Letting it use the arithmetic coder generally didn't use it universally, often preferring r4x16 rans.

I'm thinking of the 3rd option, although AVITI is marginally ahead with the 2nd. It's not a huge difference there though, and novaseq is probably still dominant in quantity and it has a big preference for the 3rd. We're also less likely to make any meaningful difference to file size on the platforms that don't aggressively quantise their quality values, as the BS data series becomes a tiny percentage of overall file size.

It's still small fry and icing on the cake frankly. For CRAM v3.1 on Novaseq PCR-free test set, the matrix change saved 9.2% for BS, but that's just 0.06% of the CRAM file size. For CRAM 3.0 it was 11% and 0.07%. Not exactly stellar!

daviesrob commented 1 year ago

I've checked that picard and old samtools understand the revised matrix, so it should be OK.