samtools / htsjdk

A Java API for high-throughput sequencing data (HTS) formats.
http://samtools.github.io/htsjdk/
281 stars 242 forks source link

CRAM bug with ambiguous bases #778

Open jkbonfield opened 7 years ago

jkbonfield commented 7 years ago

Verify

I'm using cramtools to drive the htsjdk cram code. I see no warnings present from cramtools.

Subject of the issue

Ambiguous bases in sequence and/or reference decode incorrectly when going via CRAM round-trips. Oddly it does this in a baffling way.

My basic example is this as the fasta file:

>x
ACGTRYMWSKDHVBN
acgtrymwskdhvbn
AAA

My SAM file is a sequence of ACGTRYMWSKDHVBN aligned to coordinates 1 to 30 inclusive. Note that this means some alignments are invalid, going beyond the end of the reference, but this itself is an interesting test case which ought to pass or to emit a warning if it cannot be stored. Hence the SAM file is:

@HD     VN:1.4  SO:coordinate
@SQ     SN:x    LN:33
match1  0       x       1       0       15M     *       0       0       ACGTRYMWSKDHVBN *
miss2   0       x       2       0       15M     *       0       0       ACGTRYMWSKDHVBN *
miss3   0       x       3       0       15M     *       0       0       ACGTRYMWSKDHVBN *
miss4   0       x       4       0       15M     *       0       0       ACGTRYMWSKDHVBN *
miss5   0       x       5       0       15M     *       0       0       ACGTRYMWSKDHVBN *
miss6   0       x       6       0       15M     *       0       0       ACGTRYMWSKDHVBN *
miss7   0       x       7       0       15M     *       0       0       ACGTRYMWSKDHVBN *
miss8   0       x       8       0       15M     *       0       0       ACGTRYMWSKDHVBN *
miss9   0       x       9       0       15M     *       0       0       ACGTRYMWSKDHVBN *
miss10  0       x       10      0       15M     *       0       0       ACGTRYMWSKDHVBN *
miss11  0       x       11      0       15M     *       0       0       ACGTRYMWSKDHVBN *
miss12  0       x       12      0       15M     *       0       0       ACGTRYMWSKDHVBN *
miss13  0       x       13      0       15M     *       0       0       ACGTRYMWSKDHVBN *
miss14  0       x       14      0       15M     *       0       0       ACGTRYMWSKDHVBN *
miss15  0       x       15      0       15M     *       0       0       ACGTRYMWSKDHVBN *
_match1 0       x       16      0       15M     *       0       0       ACGTRYMWSKDHVBN *
_miss2  0       x       17      0       15M     *       0       0       ACGTRYMWSKDHVBN *
_miss3  0       x       18      0       15M     *       0       0       ACGTRYMWSKDHVBN *
_miss4  0       x       19      0       15M     *       0       0       ACGTRYMWSKDHVBN *
_miss5  0       x       20      0       15M     *       0       0       ACGTRYMWSKDHVBN *
_miss6  0       x       21      0       15M     *       0       0       ACGTRYMWSKDHVBN *
_miss7  0       x       22      0       15M     *       0       0       ACGTRYMWSKDHVBN *
_miss8  0       x       23      0       15M     *       0       0       ACGTRYMWSKDHVBN *
_miss9  0       x       24      0       15M     *       0       0       ACGTRYMWSKDHVBN *
_miss10 0       x       25      0       15M     *       0       0       ACGTRYMWSKDHVBN *
_miss11 0       x       26      0       15M     *       0       0       ACGTRYMWSKDHVBN *
_miss12 0       x       27      0       15M     *       0       0       ACGTRYMWSKDHVBN *
_miss13 0       x       28      0       15M     *       0       0       ACGTRYMWSKDHVBN *
_miss14 0       x       29      0       15M     *       0       0       ACGTRYMWSKDHVBN *
_miss15 0       x       30      0       15M     *       0       0       ACGTRYMWSKDHVBN *

Your environment

Steps to reproduce

The complete round-trip commands are:

samtools faidx ambig.fa;java -jar cramtools-3.0.jar cram -I ambig.sam -O ambig.cram -R ambig.fa;java -jar cramtools-3.0.jar bam -I ambig.cram -R ambig.fa

Expected behaviour

Expected behaviour is obviously to reproduce the SAM input listed above.

Actual behaviour

Ok so this is where it gets weird...

With ref seq x of ACGTRYMWSKDHVBNacgtrymwskdhvbnAAA as listed above, my SAM->CRAM->SAM round trip ends up as:

match1  0   x   1   0   15M *   0   0   ACGTAAAAAAAAAAN *
miss2   0   x   2   0   15M *   0   0   ACGTAAAAAAAAAAN *
miss3   0   x   3   0   15M *   0   0   ACGTAAAAAAAAATN *
miss4   0   x   4   0   15M *   0   0   ACGTAAAAAAAATAN *
miss5   0   x   5   0   15M *   0   0   ACGTAAAAAAATAAN *
miss6   0   x   6   0   15M *   0   0   ACGTAAAAAATAAAN *
miss7   0   x   7   0   15M *   0   0   ACGTAAAAATAAAAN *
miss8   0   x   8   0   15M *   0   0   ACGTAAAATAAAAAN *
miss9   0   x   9   0   15M *   0   0   ACGTAAATAAAAAAN *
miss10  0   x   10  0   15M *   0   0   ACGTAATAAAAAAAN *
miss11  0   x   11  0   15M *   0   0   ACGTATAAAAAAAAN *
miss12  0   x   12  0   15M *   0   0   ACGTTAAAAAAAAAN *
miss13  0   x   13  0   15M *   0   0   ACGTAAAAAAAAAAN *
miss14  0   x   14  0   15M *   0   0   ACGTAAAAAAAAAAN *
miss15  0   x   15  0   15M *   0   0   ACGTAAAAAAAAAAN *
_match1 0   x   16  0   15M *   0   0   ACGTAAAAAAAAAAN *
_miss2  0   x   17  0   15M *   0   0   ACGTAAAAAAAAAAN *
_miss3  0   x   18  0   15M *   0   0   ACGTAAAAAAAAATN *
_miss4  0   x   19  0   15M *   0   0   ACGTAAAAAAAATTN *
_miss5  0   x   20  0   15M *   0   0   ACGTAAAAAAATTTN *
_miss6  0   x   21  0   15M *   0   0   ACGTAAAAAATTTAN *
_miss7  0   x   22  0   15M *   0   0   ACGTAAAAATTTAAN *
_miss8  0   x   23  0   15M *   0   0   ACGTAAAATTTAAAN *
_miss9  0   x   24  0   15M *   0   0   ACGTAAATTTAAAAN *
_miss10 0   x   25  0   15M *   0   0   ACGTAATTTAAAAAN *
_miss11 0   x   26  0   15M *   0   0   ACGTATTTAAAAAAN *
_miss12 0   x   27  0   15M *   0   0   ACGTTTTAAAAAAAN *
_miss13 0   x   28  0   15M *   0   0   ACGTTTAAAAAAAAN *
_miss14 0   x   29  0   15M *   0   0   ACGTTAAAAAAAAAN *
_miss15 0   x   30  0   15M *   0   0   ACGTAAAAAAAAAAN *

Most of the ambiguous bases turned to A apart from a couple which became T. In the latter half, where we know we run off the end of the reference, we see a stripe of "TTT" occuring.

If I edit the last 3 reference bases from AAA to TTT instead the round-trip generates this:

match1  0   x   1   0   15M *   0   0   ACGTAAAAAAAAAAN *
miss2   0   x   2   0   15M *   0   0   ACGTAAAAAAAAAAN *
miss3   0   x   3   0   15M *   0   0   ACGTAAAAAAAAACN *
miss4   0   x   4   0   15M *   0   0   ACGTAAAAAAAACAN *
miss5   0   x   5   0   15M *   0   0   ACGTAAAAAAACAAN *
miss6   0   x   6   0   15M *   0   0   ACGTAAAAAACAAGN *
miss7   0   x   7   0   15M *   0   0   ACGTAAAAACAAGAN *
miss8   0   x   8   0   15M *   0   0   ACGTAAAACAAGAAN *
miss9   0   x   9   0   15M *   0   0   ACGTAAACAAGAAAN *
miss10  0   x   10  0   15M *   0   0   ACGTAACAAGAAAAN *
miss11  0   x   11  0   15M *   0   0   ACGTACAAGAAAAAN *
miss12  0   x   12  0   15M *   0   0   ACGTCAAGAAAAAAN *
miss13  0   x   13  0   15M *   0   0   ACGTAAGAAAAAAAN *
miss14  0   x   14  0   15M *   0   0   ACGTAGAAAAAAAAN *
miss15  0   x   15  0   15M *   0   0   ACGTGAAAAAAAAAN *
_match1 0   x   16  0   15M *   0   0   ACGTAAAAAAAAAAN *
_miss2  0   x   17  0   15M *   0   0   ACGTAAAAAAAAAAN *
_miss3  0   x   18  0   15M *   0   0   ACGTAAAAAAAAAGN *
_miss4  0   x   19  0   15M *   0   0   ACGTAAAAAAAAGGN *
_miss5  0   x   20  0   15M *   0   0   ACGTAAAAAAAGGGN *
_miss6  0   x   21  0   15M *   0   0   ACGTAAAAAAGGGAN *
_miss7  0   x   22  0   15M *   0   0   ACGTAAAAAGGGAAN *
_miss8  0   x   23  0   15M *   0   0   ACGTAAAAGGGAAAN *
_miss9  0   x   24  0   15M *   0   0   ACGTAAAGGGAAAAN *
_miss10 0   x   25  0   15M *   0   0   ACGTAAGGGAAAAAN *
_miss11 0   x   26  0   15M *   0   0   ACGTAGGGAAAAAAN *
_miss12 0   x   27  0   15M *   0   0   ACGTGGGAAAAAAAN *
_miss13 0   x   28  0   15M *   0   0   ACGTGGAAAAAAAAN *
_miss14 0   x   29  0   15M *   0   0   ACGTGAAAAAAAAAN *
_miss15 0   x   30  0   15M *   0   0   ACGTAAAAAAAAAAN *

It's obvious that the TTTs in the output became GGGs, but also those isolated Ts earlier became either Gs or Cs too. Think about that for a moment. By changing ref seq bases 30 to 32 the round-trip changed the bases for sequences aligned to base 3..17 (the last T turned into a C).

More "exciting" still is to turn the last 3 reference bases to NNN instead:

miss2   0   x   2   0   15M *   0   0   ACGTTTTTTTTTTTN *
miss3   0   x   3   0   15M *   0   0   ACGTTTTTTTTTTCN *
miss4   0   x   4   0   15M *   0   0   ACGTTTTTTTTTCAN *
miss5   0   x   5   0   15M *   0   0   ACGTTTTTTTTCAAN *
miss6   0   x   6   0   15M *   0   0   ACGTTTTTTTCAAAN *
miss7   0   x   7   0   15M *   0   0   ACGTTTTTTCAAATN *
miss8   0   x   8   0   15M *   0   0   ACGTTTTTCAAATTN *
miss9   0   x   9   0   15M *   0   0   ACGTTTTCAAATTTN *
miss10  0   x   10  0   15M *   0   0   ACGTTTCAAATTTTN *
miss11  0   x   11  0   15M *   0   0   ACGTTCAAATTTTTN *
miss12  0   x   12  0   15M *   0   0   ACGTCAAATTTTTTN *
miss13  0   x   13  0   15M *   0   0   ACGTAAATTTTTTTN *
miss14  0   x   14  0   15M *   0   0   ACGTAATTTTTTTTN *
miss15  0   x   15  0   15M *   0   0   ACGTATTTTTTTTTN *
_match1 0   x   16  0   15M *   0   0   ACGTTTTTTTTTTTN *
_miss2  0   x   17  0   15M *   0   0   ACGTTTTTTTTTTTN *
_miss3  0   x   18  0   15M *   0   0   ACGTTTTTTTTTTTN *
_miss4  0   x   19  0   15M *   0   0   ACGTTTTTTTTTTTN *
_miss5  0   x   20  0   15M *   0   0   ACGTTTTTTTTTTTN *
_miss6  0   x   21  0   15M *   0   0   ACGTTTTTTTTTTTN *
_miss7  0   x   22  0   15M *   0   0   ACGTTTTTTTTTTTN *
_miss8  0   x   23  0   15M *   0   0   ACGTTTTTTTTTTTN *
_miss9  0   x   24  0   15M *   0   0   ACGTTTTTTTTTTTN *
_miss10 0   x   25  0   15M *   0   0   ACGTTTTTTTTTTTN *
_miss11 0   x   26  0   15M *   0   0   ACGTTTTTTTTTTTN *
_miss12 0   x   27  0   15M *   0   0   ACGTTTTTTTTTTTN *
_miss13 0   x   28  0   15M *   0   0   ACGTTTTTTTTTTTN *
_miss14 0   x   29  0   15M *   0   0   ACGTTTTTTTTTTTN *
_miss15 0   x   30  0   15M *   0   0   ACGTTTTTTTTTTTN *

We're now really singing in the baffling changes! This is repeatable behaviour, so it's not due to ininitialised data. Samtools decoding the CRAM file shows the same affect (but not when samtools encodeds), which implies this is an issue in the Java CRAM encoder. Indeed Java can mostly decode the samtools generated files, but demonstrates a different bug:

@ seq3a[/tmp]; samtools view -t ambig.fa -C -o ambig.cram ambig.sam
@ seq3a[/tmp]; samtools view --input-fmt-option decode_md=0 ambig.cram
match1  0   x   1   0   15M *   0   0   ACGTRYMWSKDHVBN *
miss2   0   x   2   0   15M *   0   0   ACGTRYMWSKDHVBN *
miss3   0   x   3   0   15M *   0   0   ACGTRYMWSKDHVBN *
miss4   0   x   4   0   15M *   0   0   ACGTRYMWSKDHVBN *
miss5   0   x   5   0   15M *   0   0   ACGTRYMWSKDHVBN *
miss6   0   x   6   0   15M *   0   0   ACGTRYMWSKDHVBN *
miss7   0   x   7   0   15M *   0   0   ACGTRYMWSKDHVBN *
miss8   0   x   8   0   15M *   0   0   ACGTRYMWSKDHVBN *
miss9   0   x   9   0   15M *   0   0   ACGTRYMWSKDHVBN *
miss10  0   x   10  0   15M *   0   0   ACGTRYMWSKDHVBN *
miss11  0   x   11  0   15M *   0   0   ACGTRYMWSKDHVBN *
miss12  0   x   12  0   15M *   0   0   ACGTRYMWSKDHVBN *
miss13  0   x   13  0   15M *   0   0   ACGTRYMWSKDHVBN *
miss14  0   x   14  0   15M *   0   0   ACGTRYMWSKDHVBN *
miss15  0   x   15  0   15M *   0   0   ACGTRYMWSKDHVBN *
_match1 0   x   16  0   15M *   0   0   ACGTRYMWSKDHVBN *
_miss2  0   x   17  0   15M *   0   0   ACGTRYMWSKDHVBN *
_miss3  0   x   18  0   15M *   0   0   ACGTRYMWSKDHVBN *
_miss4  0   x   19  0   15M *   0   0   ACGTRYMWSKDHVBN *
_miss5  0   x   20  0   15M *   0   0   ACGTRYMWSKDHVBN *
_miss6  0   x   21  0   15M *   0   0   ACGTRYMWSKDHVBN *
_miss7  0   x   22  0   15M *   0   0   ACGTRYMWSKDHVBN *
_miss8  0   x   23  0   15M *   0   0   ACGTRYMWSKDHVBN *
_miss9  0   x   24  0   15M *   0   0   ACGTRYMWSKDHVBN *
_miss10 0   x   25  0   15M *   0   0   ACGTRYMWSKDHVBN *
_miss11 0   x   26  0   15M *   0   0   ACGTRYMWSKDHVBN *
_miss12 0   x   27  0   15M *   0   0   ACGTRYMWSKDHVBN *
_miss13 0   x   28  0   15M *   0   0   ACGTRYMWSKDHVBN *
_miss14 0   x   29  0   15M *   0   0   ACGTRYMWSKDHVBN *
_miss15 0   x   30  0   15M *   0   0   ACGTRYMWSKDHVBN *
@ seq3a[/tmp]; java -jar ~/work/cram/cramtools/cramtools-3.0.jar bam -I ambig.cram -R ambig.fa
match1  0   x   1   0   15M *   0   0   ACGTRYMWSKDHVBN *
miss2   0   x   2   0   15M *   0   0   ACGTNNNNNNNNNNN *
miss3   0   x   3   0   15M *   0   0   ACGTNNNNNNNNNNN *
miss4   0   x   4   0   15M *   0   0   ACGTNNNNNNNNNNN *
miss5   0   x   5   0   15M *   0   0   ACGTNNNNNNNNNNN *
miss6   0   x   6   0   15M *   0   0   ACGTNNNNNNNNNNN *
miss7   0   x   7   0   15M *   0   0   ACGTNNNNNNNNNNN *
miss8   0   x   8   0   15M *   0   0   ACGTNNNNNNNNNNN *
miss9   0   x   9   0   15M *   0   0   ACGTNNNNNNNNNNN *
miss10  0   x   10  0   15M *   0   0   ACGTNNNNNNNNNNN *
miss11  0   x   11  0   15M *   0   0   ACGTNNNNNNNNNNN *
miss12  0   x   12  0   15M *   0   0   ACGTNNNNNNNNNNN *
miss13  0   x   13  0   15M *   0   0   ACGTNNNNNNNNNNN *
miss14  0   x   14  0   15M *   0   0   ACGTNNNNNNNNNNN *
miss15  0   x   15  0   15M *   0   0   ACGTNNNNNNNNNNN *
_match1 0   x   16  0   15M *   0   0   ACGTRYMWSKDHVBN *
_miss2  0   x   17  0   15M *   0   0   ACGTNNNNNNNNNNN *
_miss3  0   x   18  0   15M *   0   0   ACGTNNNNNNNNNNN *
_miss4  0   x   19  0   15M *   0   0   ACGTNNNNNNNNNNN *
_miss5  0   x   20  0   15M *   0   0   ACGTNNNNNNNNNNN *
_miss6  0   x   21  0   15M *   0   0   ACGTNNNNNNNNNNN *
_miss7  0   x   22  0   15M *   0   0   ACGTNNNNNNNNNNN *
_miss8  0   x   23  0   15M *   0   0   ACGTNNNNNNNNNNN *
_miss9  0   x   24  0   15M *   0   0   ACGTNNNNNNNNNNN *
_miss10 0   x   25  0   15M *   0   0   ACGTNNNNNNNNNNN *
_miss11 0   x   26  0   15M *   0   0   ACGTNNNNNNNNNNN *
_miss12 0   x   27  0   15M *   0   0   ACGTNNNNNNNNNNN *
_miss13 0   x   28  0   15M *   0   0   ACGTNNNNNNNNNNN *
_miss14 0   x   29  0   15M *   0   0   ACGTNNNNNNNNNNN *
_miss15 0   x   30  0   15M *   0   0   ACGTNNNNNNNNNNN *

Here Cramtools decode turned all the ambiguous bases to Ns. The bases are explicitly recorded in the CRAM file. Eg read 2 has (from io_lib's "cram_dump -v"):

Rec 2/30 at 0,7
BF = 0 => SAM 0x0 (ret 0, out_sz 1)
CF = 3 (ret 0, out_sz 1)
RL = 15 (ret 0, out_sz 1)
AP = 1 (ret 0, out_sz 1)
RG = -1 (ret 0, out_sz 1)
RN = miss2 (ret 0, out_sz 5)
Detached
MF = 0 (ret 0, out_sz 1)
NS = -1 (ret 0, out_sz 1)
NP = 0 (ret 0, out_sz 1)
TS = 0 (ret 0, out_sz 1)
TL = 0 (ret 0, out_sz 1)
FN = 15 (ret 0, out_sz 1)
  0: FC = X (ret 0, out_sz 1)
  0: FP = 1+0 = 1 (ret 0, out_sz 1)
  0: BS = 0 (ret 0)
  1: FC = X (ret 0, out_sz 1)
  1: FP = 1+1 = 2 (ret 0, out_sz 1)
  1: BS = 1 (ret 0)
  2: FC = X (ret 0, out_sz 1)
  2: FP = 1+2 = 3 (ret 0, out_sz 1)
  2: BS = 2 (ret 0)
  3: FC = X (ret 0, out_sz 1)
  3: FP = 1+3 = 4 (ret 0, out_sz 1)
  3: BS = 3 (ret 0)
  4: FC = B (ret 0, out_sz 1)
  4: FP = 1+4 = 5 (ret 0, out_sz 1)
  4: BA/QS(B) = R/-1 (ret 0)
  5: FC = B (ret 0, out_sz 1)
  5: FP = 1+5 = 6 (ret 0, out_sz 1)
  5: BA/QS(B) = Y/-1 (ret 0)
  6: FC = B (ret 0, out_sz 1)
  6: FP = 1+6 = 7 (ret 0, out_sz 1)
  6: BA/QS(B) = M/-1 (ret 0)
  7: FC = B (ret 0, out_sz 1)
  7: FP = 1+7 = 8 (ret 0, out_sz 1)
  7: BA/QS(B) = W/-1 (ret 0)
  8: FC = B (ret 0, out_sz 1)
  8: FP = 1+8 = 9 (ret 0, out_sz 1)
  8: BA/QS(B) = S/-1 (ret 0)
  9: FC = B (ret 0, out_sz 1)
  9: FP = 1+9 = 10 (ret 0, out_sz 1)
  9: BA/QS(B) = K/-1 (ret 0)
  10: FC = B (ret 0, out_sz 1)
  10: FP = 1+10 = 11 (ret 0, out_sz 1)
  10: BA/QS(B) = D/-1 (ret 0)
  11: FC = B (ret 0, out_sz 1)
  11: FP = 1+11 = 12 (ret 0, out_sz 1)
  11: BA/QS(B) = H/-1 (ret 0)
  12: FC = B (ret 0, out_sz 1)
  12: FP = 1+12 = 13 (ret 0, out_sz 1)
  12: BA/QS(B) = V/-1 (ret 0)
  13: FC = B (ret 0, out_sz 1)
  13: FP = 1+13 = 14 (ret 0, out_sz 1)
  13: BA/QS(B) = B/-1 (ret 0)
  14: FC = X (ret 0, out_sz 1)
  14: FP = 1+14 = 15 (ret 0, out_sz 1)
  14: BS = 3 (ret 0)
MQ = 0 (ret 0, out_sz 1)

So the data is there, but is being modified by the decoder somewhere. I think this is perhaps more than one bug.

jkbonfield commented 7 years ago

I just had an idea - to use cram_dump -v to dump out the differences in Java produced CRAM files for ref ending AAA vs NNN. This explains why so many bases, aligned in different locations, change. The base substitution statistics change which in turn generates different substitution matrices. The containers in the two cram files show:

      Substitution map:
        A: CGTN
        C: AGTN
        G: ACTN
        T: ACGN
        N: TGCA

vs

      Substitution map:
        A: TNGC
        C: AGTN
        G: ACTN
        T: ACGN
        N: ACGT

The main problem though is it is generating base substitutions that are impossible. CRAM BS data series is used for mapping ACGTN to ACGTN subst, as a number to lookup via the table in the header. (This is because we know we cannot ever generate and A->A substitution, thus the numeric solution has lower entropy and the table permits it to be optimised.) However this cannot work with ambiguous bases in either reference or sequence.

Oddly it does this for matches too. So rec 1/30 (aligned correctly, all matches ref, albeit M vs M, R vs R, etc):

FN = 10 (ret 0, out_sz 1)
  0: FC = X (ret 0, out_sz 1)
  0: FP = 5+0 = 5 (ret 0, out_sz 1)
  0: BS = 0 (ret 0)
  1: FC = X (ret 0, out_sz 1)
  1: FP = 1+5 = 6 (ret 0, out_sz 1)
  1: BS = 0 (ret 0)
  2: FC = X (ret 0, out_sz 1)
  2: FP = 1+6 = 7 (ret 0, out_sz 1)
  2: BS = 0 (ret 0)
  3: FC = X (ret 0, out_sz 1)
  3: FP = 1+7 = 8 (ret 0, out_sz 1)
  3: BS = 0 (ret 0)
  4: FC = X (ret 0, out_sz 1)
  4: FP = 1+8 = 9 (ret 0, out_sz 1)
  4: BS = 0 (ret 0)
  5: FC = X (ret 0, out_sz 1)
  5: FP = 1+9 = 10 (ret 0, out_sz 1)
  5: BS = 0 (ret 0)
  6: FC = X (ret 0, out_sz 1)
  6: FP = 1+10 = 11 (ret 0, out_sz 1)
  6: BS = 0 (ret 0)
  7: FC = X (ret 0, out_sz 1)
  7: FP = 1+11 = 12 (ret 0, out_sz 1)
  7: BS = 0 (ret 0)
  8: FC = X (ret 0, out_sz 1)
  8: FP = 1+12 = 13 (ret 0, out_sz 1)
  8: BS = 0 (ret 0)
  9: FC = X (ret 0, out_sz 1)
  9: FP = 1+13 = 14 (ret 0, out_sz 1)
  9: BS = 0 (ret 0)

So the first 4 bases (ACGT) matched and didn't generate a feature code, but all subsequent ones produce FC 'X' (substitution) and BS 0 (nonsensical).

Record 2/30, which is slipped by one and hence 100% mismatch, also uses purely FC X and BS 0 (except for the early ACG vs CGT alignment). Samtools/Scramble produced this output instead for rec 1 and 2:

Rec 1/30 at 0,7
FN = 0 (ret 0, out_sz 1)
[Ie entirely matches]

Rec 2/30 at 0,7
FN = 15 (ret 0, out_sz 1)
  0: FC = X (ret 0, out_sz 1)
  0: FP = 1+0 = 1 (ret 0, out_sz 1)
  0: BS = 0 (ret 0)
  1: FC = X (ret 0, out_sz 1)
  1: FP = 1+1 = 2 (ret 0, out_sz 1)
  1: BS = 1 (ret 0)
  2: FC = X (ret 0, out_sz 1)
  2: FP = 1+2 = 3 (ret 0, out_sz 1)
  2: BS = 2 (ret 0)
  3: FC = X (ret 0, out_sz 1)
  3: FP = 1+3 = 4 (ret 0, out_sz 1)
  3: BS = 3 (ret 0)
  4: FC = B (ret 0, out_sz 1)
  4: FP = 1+4 = 5 (ret 0, out_sz 1)
  4: BA/QS(B) = R/-1 (ret 0)
  5: FC = B (ret 0, out_sz 1)
  5: FP = 1+5 = 6 (ret 0, out_sz 1)
  5: BA/QS(B) = Y/-1 (ret 0)
...

Ie it uses feature 'X' for the initial ACGT base mismatches and feature 'B' for the rest to store verbatim the corrected sequence. I'm wondering why I chose a series of B instead of a single 'b' string, but maybe that's just historic as 'b' didn't arrive until much later.

This explains why Cramtools decode of the samtools generated CRAM gets the correct sequence for the matching reads (they have no features at all), and why the ACGT bases are correct for the mismatching ones (via valid 'X' features). However it doesn't really explain why all the verbatim 'B' features were replaced with Ns instead.

Admittedly ambiguous bases in the records is pretty rare.

cmnbroad commented 7 years ago

@jkbonfield Thanks for the detailed report. Although cramtools is built with htsjdk, I think it has (or at least the repo contains) overrides/replacements for a fair amount of the htsjdk cram implementation (??). I think this should ticket should probably live there until we determine the bug is in htsjdk. I'm assigning to @vadimzalunin, and he can opine/decide where this should live.

jkbonfield commented 7 years ago

@cmnbroad, I just tried Picard (2.6.0 because 2.7.2 brings other issues related to this CRAM file) and with SamFormatConverter and ViewSam it demonstrates the same issue. (I tweaked the .fa file to be longer to prevent reads aligning off the end, but the ambiguous bases still change.)

Round trip then gives:

match1  0   x   1   0   15M *   0   0   ACGTAAAAAAAAAAN *
miss2   0   x   2   0   15M *   0   0   ACGTAAAAAAAAAAN *
miss3   0   x   3   0   15M *   0   0   ACGTAAAAAAAAANN *
miss4   0   x   4   0   15M *   0   0   ACGTAAAAAAAANAN *
miss5   0   x   5   0   15M *   0   0   ACGTAAAAAAANAAN *
miss6   0   x   6   0   15M *   0   0   ACGTAAAAAANAAAN *
miss7   0   x   7   0   15M *   0   0   ACGTAAAAANAAAAN *
miss8   0   x   8   0   15M *   0   0   ACGTAAAANAAAAAN *
miss9   0   x   9   0   15M *   0   0   ACGTAAANAAAAAAN *
miss10  0   x   10  0   15M *   0   0   ACGTAANAAAAAAAN *
miss11  0   x   11  0   15M *   0   0   ACGTANAAAAAAAAN *
miss12  0   x   12  0   15M *   0   0   ACGTNAAAAAAAAAN *
miss13  0   x   13  0   15M *   0   0   ACGTAAAAAAAAAAN *
miss14  0   x   14  0   15M *   0   0   ACGTAAAAAAAAAAN *
miss15  0   x   15  0   15M *   0   0   ACGTAAAAAAAAAAN *
_match1 0   x   16  0   15M *   0   0   ACGTAAAAAAAAAAN *
_miss2  0   x   17  0   15M *   0   0   ACGTAAAAAAAAAAN *
_miss3  0   x   18  0   15M *   0   0   ACGTAAAAAAAAANN *
_miss4  0   x   19  0   15M *   0   0   ACGTAAAAAAAANNN *
_miss5  0   x   20  0   15M *   0   0   ACGTAAAAAAANNNN *
_miss6  0   x   21  0   15M *   0   0   ACGTAAAAAANNNNN *
_miss7  0   x   22  0   15M *   0   0   ACGTAAAAANNNNNN *
_miss8  0   x   23  0   15M *   0   0   ACGTAAAANNNNNNN *
_miss9  0   x   24  0   15M *   0   0   ACGTAAANNNNNNNN *
_miss10 0   x   25  0   15M *   0   0   ACGTAANNNNNNNNN *
_miss11 0   x   26  0   15M *   0   0   ACGTANNNNNNNNNN *
_miss12 0   x   27  0   15M *   0   0   ACGTNNNNNNNNNNN *
_miss13 0   x   28  0   15M *   0   0   ACGTNNNNNNNNNNN *
_miss14 0   x   29  0   15M *   0   0   ACGTNNNNNNNNNNN *
_miss15 0   x   30  0   15M *   0   0   ACGTNNNNNNNNNNN *

PS. With 2.7.2 we run into issues the new checking of Slice MD5 values, which is infact how come I ended up investigating this in the first place:

ERROR   2017-01-05 14:12:57 Slice   Reference MD5 mismatch for slice 0:1-44, ACGTNNNNNN...AAAAAAAAAA

Note it's changed the reference to replace all ambiguous bases by N, but in the process means the MD5 no longer matches. My belief is that the CRAM MD5 values in slices should match the same rules used in the @SQ lines, in the absence of any other clear statement.

See http://gatkforums.broadinstitute.org/gatk/discussion/8786/cram-support-in-gatk-3-7-is-broken/p1?new=1 for discussion on the MD5 issue, but it's unrelated to this bug so belongs in its own ticket.

vadimzalunin commented 7 years ago

Confirmed. Looks like subs matrix does not expect amb bases in reads.

jkbonfield commented 7 years ago

Correct - by design we only supported the obvious base types in the substitution matrix to keep it small and simple (and probably efficient). In Scramble I switch to generating base features instead of substitution features when the matrix isn't applicable.

vadimzalunin commented 7 years ago

Ie it uses feature 'X' for the initial ACGT base mismatches and feature 'B' for the rest to store verbatim the corrected sequence.

@jkbonfield what quality score do you use then?

jkbonfield commented 7 years ago

I use the actual quality score for that base, or 255 (-1) if it has none (as my example above did). This isn't ideal as the qualities have to be coded twice in a rather ugly fashion. I guess this is one reason why we added the "b" operator instead.

cmnbroad commented 7 years ago

@vadimzalunin Can you confirm that this is resolved by https://github.com/samtools/htsjdk/pull/889 (and if so close it as resolved).