Open jkbonfield opened 7 years ago
I just had an idea - to use cram_dump -v to dump out the differences in Java produced CRAM files for ref ending AAA vs NNN. This explains why so many bases, aligned in different locations, change. The base substitution statistics change which in turn generates different substitution matrices. The containers in the two cram files show:
Substitution map:
A: CGTN
C: AGTN
G: ACTN
T: ACGN
N: TGCA
vs
Substitution map:
A: TNGC
C: AGTN
G: ACTN
T: ACGN
N: ACGT
The main problem though is it is generating base substitutions that are impossible. CRAM BS data series is used for mapping ACGTN to ACGTN subst, as a number to lookup via the table in the header. (This is because we know we cannot ever generate and A->A substitution, thus the numeric solution has lower entropy and the table permits it to be optimised.) However this cannot work with ambiguous bases in either reference or sequence.
Oddly it does this for matches too. So rec 1/30 (aligned correctly, all matches ref, albeit M vs M, R vs R, etc):
FN = 10 (ret 0, out_sz 1)
0: FC = X (ret 0, out_sz 1)
0: FP = 5+0 = 5 (ret 0, out_sz 1)
0: BS = 0 (ret 0)
1: FC = X (ret 0, out_sz 1)
1: FP = 1+5 = 6 (ret 0, out_sz 1)
1: BS = 0 (ret 0)
2: FC = X (ret 0, out_sz 1)
2: FP = 1+6 = 7 (ret 0, out_sz 1)
2: BS = 0 (ret 0)
3: FC = X (ret 0, out_sz 1)
3: FP = 1+7 = 8 (ret 0, out_sz 1)
3: BS = 0 (ret 0)
4: FC = X (ret 0, out_sz 1)
4: FP = 1+8 = 9 (ret 0, out_sz 1)
4: BS = 0 (ret 0)
5: FC = X (ret 0, out_sz 1)
5: FP = 1+9 = 10 (ret 0, out_sz 1)
5: BS = 0 (ret 0)
6: FC = X (ret 0, out_sz 1)
6: FP = 1+10 = 11 (ret 0, out_sz 1)
6: BS = 0 (ret 0)
7: FC = X (ret 0, out_sz 1)
7: FP = 1+11 = 12 (ret 0, out_sz 1)
7: BS = 0 (ret 0)
8: FC = X (ret 0, out_sz 1)
8: FP = 1+12 = 13 (ret 0, out_sz 1)
8: BS = 0 (ret 0)
9: FC = X (ret 0, out_sz 1)
9: FP = 1+13 = 14 (ret 0, out_sz 1)
9: BS = 0 (ret 0)
So the first 4 bases (ACGT) matched and didn't generate a feature code, but all subsequent ones produce FC 'X' (substitution) and BS 0 (nonsensical).
Record 2/30, which is slipped by one and hence 100% mismatch, also uses purely FC X and BS 0 (except for the early ACG vs CGT alignment). Samtools/Scramble produced this output instead for rec 1 and 2:
Rec 1/30 at 0,7
FN = 0 (ret 0, out_sz 1)
[Ie entirely matches]
Rec 2/30 at 0,7
FN = 15 (ret 0, out_sz 1)
0: FC = X (ret 0, out_sz 1)
0: FP = 1+0 = 1 (ret 0, out_sz 1)
0: BS = 0 (ret 0)
1: FC = X (ret 0, out_sz 1)
1: FP = 1+1 = 2 (ret 0, out_sz 1)
1: BS = 1 (ret 0)
2: FC = X (ret 0, out_sz 1)
2: FP = 1+2 = 3 (ret 0, out_sz 1)
2: BS = 2 (ret 0)
3: FC = X (ret 0, out_sz 1)
3: FP = 1+3 = 4 (ret 0, out_sz 1)
3: BS = 3 (ret 0)
4: FC = B (ret 0, out_sz 1)
4: FP = 1+4 = 5 (ret 0, out_sz 1)
4: BA/QS(B) = R/-1 (ret 0)
5: FC = B (ret 0, out_sz 1)
5: FP = 1+5 = 6 (ret 0, out_sz 1)
5: BA/QS(B) = Y/-1 (ret 0)
...
Ie it uses feature 'X' for the initial ACGT base mismatches and feature 'B' for the rest to store verbatim the corrected sequence. I'm wondering why I chose a series of B instead of a single 'b' string, but maybe that's just historic as 'b' didn't arrive until much later.
This explains why Cramtools decode of the samtools generated CRAM gets the correct sequence for the matching reads (they have no features at all), and why the ACGT bases are correct for the mismatching ones (via valid 'X' features). However it doesn't really explain why all the verbatim 'B' features were replaced with Ns instead.
Admittedly ambiguous bases in the records is pretty rare.
@jkbonfield Thanks for the detailed report. Although cramtools is built with htsjdk, I think it has (or at least the repo contains) overrides/replacements for a fair amount of the htsjdk cram implementation (??). I think this should ticket should probably live there until we determine the bug is in htsjdk. I'm assigning to @vadimzalunin, and he can opine/decide where this should live.
@cmnbroad, I just tried Picard (2.6.0 because 2.7.2 brings other issues related to this CRAM file) and with SamFormatConverter and ViewSam it demonstrates the same issue. (I tweaked the .fa file to be longer to prevent reads aligning off the end, but the ambiguous bases still change.)
Round trip then gives:
match1 0 x 1 0 15M * 0 0 ACGTAAAAAAAAAAN *
miss2 0 x 2 0 15M * 0 0 ACGTAAAAAAAAAAN *
miss3 0 x 3 0 15M * 0 0 ACGTAAAAAAAAANN *
miss4 0 x 4 0 15M * 0 0 ACGTAAAAAAAANAN *
miss5 0 x 5 0 15M * 0 0 ACGTAAAAAAANAAN *
miss6 0 x 6 0 15M * 0 0 ACGTAAAAAANAAAN *
miss7 0 x 7 0 15M * 0 0 ACGTAAAAANAAAAN *
miss8 0 x 8 0 15M * 0 0 ACGTAAAANAAAAAN *
miss9 0 x 9 0 15M * 0 0 ACGTAAANAAAAAAN *
miss10 0 x 10 0 15M * 0 0 ACGTAANAAAAAAAN *
miss11 0 x 11 0 15M * 0 0 ACGTANAAAAAAAAN *
miss12 0 x 12 0 15M * 0 0 ACGTNAAAAAAAAAN *
miss13 0 x 13 0 15M * 0 0 ACGTAAAAAAAAAAN *
miss14 0 x 14 0 15M * 0 0 ACGTAAAAAAAAAAN *
miss15 0 x 15 0 15M * 0 0 ACGTAAAAAAAAAAN *
_match1 0 x 16 0 15M * 0 0 ACGTAAAAAAAAAAN *
_miss2 0 x 17 0 15M * 0 0 ACGTAAAAAAAAAAN *
_miss3 0 x 18 0 15M * 0 0 ACGTAAAAAAAAANN *
_miss4 0 x 19 0 15M * 0 0 ACGTAAAAAAAANNN *
_miss5 0 x 20 0 15M * 0 0 ACGTAAAAAAANNNN *
_miss6 0 x 21 0 15M * 0 0 ACGTAAAAAANNNNN *
_miss7 0 x 22 0 15M * 0 0 ACGTAAAAANNNNNN *
_miss8 0 x 23 0 15M * 0 0 ACGTAAAANNNNNNN *
_miss9 0 x 24 0 15M * 0 0 ACGTAAANNNNNNNN *
_miss10 0 x 25 0 15M * 0 0 ACGTAANNNNNNNNN *
_miss11 0 x 26 0 15M * 0 0 ACGTANNNNNNNNNN *
_miss12 0 x 27 0 15M * 0 0 ACGTNNNNNNNNNNN *
_miss13 0 x 28 0 15M * 0 0 ACGTNNNNNNNNNNN *
_miss14 0 x 29 0 15M * 0 0 ACGTNNNNNNNNNNN *
_miss15 0 x 30 0 15M * 0 0 ACGTNNNNNNNNNNN *
PS. With 2.7.2 we run into issues the new checking of Slice MD5 values, which is infact how come I ended up investigating this in the first place:
ERROR 2017-01-05 14:12:57 Slice Reference MD5 mismatch for slice 0:1-44, ACGTNNNNNN...AAAAAAAAAA
Note it's changed the reference to replace all ambiguous bases by N, but in the process means the MD5 no longer matches. My belief is that the CRAM MD5 values in slices should match the same rules used in the @SQ lines, in the absence of any other clear statement.
See http://gatkforums.broadinstitute.org/gatk/discussion/8786/cram-support-in-gatk-3-7-is-broken/p1?new=1 for discussion on the MD5 issue, but it's unrelated to this bug so belongs in its own ticket.
Confirmed. Looks like subs matrix does not expect amb bases in reads.
Correct - by design we only supported the obvious base types in the substitution matrix to keep it small and simple (and probably efficient). In Scramble I switch to generating base features instead of substitution features when the matrix isn't applicable.
Ie it uses feature 'X' for the initial ACGT base mismatches and feature 'B' for the rest to store verbatim the corrected sequence.
@jkbonfield what quality score do you use then?
I use the actual quality score for that base, or 255 (-1) if it has none (as my example above did). This isn't ideal as the qualities have to be coded twice in a rather ugly fashion. I guess this is one reason why we added the "b" operator instead.
@vadimzalunin Can you confirm that this is resolved by https://github.com/samtools/htsjdk/pull/889 (and if so close it as resolved).
Verify
I'm using cramtools to drive the htsjdk cram code. I see no warnings present from cramtools.
Subject of the issue
Ambiguous bases in sequence and/or reference decode incorrectly when going via CRAM round-trips. Oddly it does this in a baffling way.
My basic example is this as the fasta file:
My SAM file is a sequence of ACGTRYMWSKDHVBN aligned to coordinates 1 to 30 inclusive. Note that this means some alignments are invalid, going beyond the end of the reference, but this itself is an interesting test case which ought to pass or to emit a warning if it cannot be stored. Hence the SAM file is:
Your environment
version of htsjdk The latest checkout that works with cramtools (which is itself the git master). For htsjdk this means 88b6719 as the following commit removes functions used by cramtools. (I assume a fix for cramtools is incoming.)
version of java jdk1.8.0_74
which OS Ubuntu 12.04
Steps to reproduce
The complete round-trip commands are:
Expected behaviour
Expected behaviour is obviously to reproduce the SAM input listed above.
Actual behaviour
Ok so this is where it gets weird...
With ref seq x of ACGTRYMWSKDHVBNacgtrymwskdhvbnAAA as listed above, my SAM->CRAM->SAM round trip ends up as:
Most of the ambiguous bases turned to A apart from a couple which became T. In the latter half, where we know we run off the end of the reference, we see a stripe of "TTT" occuring.
If I edit the last 3 reference bases from AAA to TTT instead the round-trip generates this:
It's obvious that the TTTs in the output became GGGs, but also those isolated Ts earlier became either Gs or Cs too. Think about that for a moment. By changing ref seq bases 30 to 32 the round-trip changed the bases for sequences aligned to base 3..17 (the last T turned into a C).
More "exciting" still is to turn the last 3 reference bases to NNN instead:
We're now really singing in the baffling changes! This is repeatable behaviour, so it's not due to ininitialised data. Samtools decoding the CRAM file shows the same affect (but not when samtools encodeds), which implies this is an issue in the Java CRAM encoder. Indeed Java can mostly decode the samtools generated files, but demonstrates a different bug:
Here Cramtools decode turned all the ambiguous bases to Ns. The bases are explicitly recorded in the CRAM file. Eg read 2 has (from io_lib's "cram_dump -v"):
So the data is there, but is being modified by the decoder somewhere. I think this is perhaps more than one bug.