Closed tdlong closed 3 months ago
I've looked into this issue, and I believe all of this is in fact expected behavior. I'll try to clarify.
This is the relevant subgraph:
S 2035438 TCGTATATATTTCAAACGGTTGTCTTCATACAAAACTGTATTACTTATT
S 2035446 GTACATGCTGAATTCGCATATTTAGTATTAGGGTGTATCTAAAGATCTACTAGGGTGACCCTAAGGAATTAGGGTGGTCCTAAGTTTACTTATTAGTCGACTTATTATTATGATTCATATTATAATTATTATTAATATTATTATTATTATTGTTATTATTATTGTTATTATGTTCGTATATACTACAATT
S 2035447 ATATGCAATCATTATTTTGTTTATATTGTAGAAG
L 2035438 + 2035447 + 0M
L 2035446 + 2035447 + 0M
As you note, node 2035438
is not found in the B genome, but it is found the the dm6
reference. In the B genome, node 2035446
comes immediately before 2035447
, and 2035446
is not found on dm6
. Accordingly, for any alignment that spans 2035446
and 2035447
, the left end of the alignment (on 2035446
) has to be realigned onto the dm6
reference, where it will try to align to 2035438
. You'll notice that 2035446
and 2035438
have 3 bases of identical sequence on the 3' end. Thus, the realignment will find at least 3 matching bases on 2035438
before soft-clipping the rest of the read.
Here's an example of one of these read pairs' surjected graph alignment before converting to BAM:
c1_17184_17655_0:0:0_0:0:0_3b97/1 150 0 150 + <2035475<2035473<2035472<2035470<2035469<2035467<2035466<2035464<2035463 199 8 158 148 150 60 AS:i:150 bq:Z:IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII cs:Z::129*GA:2*GA:17 fn:Z:c1_17184_17655_0:0:0_0:0:0_3b97/2 pd:b:1
c1_17184_17655_0:0:0_0:0:0_3b97/2 150 0 150 + >2035438>2035447 83 46 50 4 150 60 AS:i:9 bq:Z:IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII cs:Z:+AAGATCTACTAGGGTGACCCTAAGGAATTAGGGTGGTCCTAAGTTTACTTATTAGTCGACTTATTATTATGATTCATATTATAATTATTATTAATATTATTATTATTATTGTTATTATTATTGTTATTATGTTCGTATATACTACA:4 fp:Z:c1_17184_17655_0:0:0_0:0:0_3b97/1 pd:b:1
The cause of the unexpected flags is related but somewhat different. The unmapped reads correspond to alignments that have 0 bases of overlap with the dm6
reference. vg surject
reports these reads as unmapped, because the graph alignment can't be approximated on the corresponding reference. The reason you only see the unexpected flags on the B genome is because the B genome has long sequence in that range that are not shared with dm6
, whereas in the A genome there are none in that range.
I am not sure I understand your response. Segment 2035438 is monomorphic, meaning it exists in all 3 strains (reference, A & B). So given the way I carry out the simulation, I think a correct counting should result in a ratio of counts of fragment 2035438 in sample A versus B that is roughly constant (compared to other fragments present in all 3 strains, or my controls). This is exactly what I see if a query the gaf file directly -- where the count of 2035438 forward reads is 1518. Shouldn't the count be the same in the bam file if I query those same coordinates (But, I observe that it isn't)? It still seems to me that the correct behavior should be that directly querying a monomorphic fragment in the gaf and the bam should give the same count. I think that either the count in the gaf or the count in the bam is incorrect, how can they be different?
For monomorphic fragments the two approaches (gaf or bam) are measuring the same thing - the number of time a read overlaps that fragment. I think giraffe is getting this perfect.
The bigger phenomena, I did not present, is that if you make a coverage bed from the surject bams for sample A and B you see a drop in coverage in the 50-100 bp flanking the insertion point. This drop in coverage is very similar to the one you see when looking at the bams from a bwa-mem aligned to referect. But you can easily show you expect bwa-mem to get it wrong, because it doesn't know about the insert! I believe giraffe is getting the count right, but that information is not translating back to the bam in surject (you are 100% correct that I do not see why this is the case).
Is it OK for the gaf and the bam to disagree for a fragment this is present in ref, A, & B?
vg surject
is implemented to preserve everything possible of the original graph alignment. When that is not possible, it realigns off-reference portions of the graph alignment to the reference sequence. This realignment step can cause the alignment to overlap regions on the reference where it didn't previously touch.
Here's an illustration to show how it worked in the example I gave above
(read alignment before surject)
AAGATCTACTAGGGTGACCCTAAGGAATTAGGGTGGTCCTAAGTTTACTTATTAGTCGACTTATTATTATGATTCATATTATAATTATTATTAATATTATTATTATTATTGTTATTATTATTGTTATTATGTTCGTATATACTACAATT------A
GTACATGCTGAATTCGCATATTTAGTATTAGGGTGTATCTAAAGATCTACTAGGGTGACCCTAAGGAATTAGGGTGGTCCTAAGTTTACTTATTAGTCGACTTATTATTATGATTCATATTATAATTATTATTAATATTATTATTATTATTGTTATTATTATTGTTATTATGTTCGTATATACTACAATT -> ATATGCAATCATTATTTTGTTTATATTGTAGAAG
2035446 (b6) / 2035447 (dm6 and b6)
/
/
/
TCGTATATATTTCAAACGGTTGTCTTCATACAAAACTGTATTACTTATT
2035438 (dm6)
(read alignment after surject)
A
GTACATGCTGAATTCGCATATTTAGTATTAGGGTGTATCTAAAGATCTACTAGGGTGACCCTAAGGAATTAGGGTGGTCCTAAGTTTACTTATTAGTCGACTTATTATTATGATTCATATTATAATTATTATTAATATTATTATTATTATTGTTATTATTATTGTTATTATGTTCGTATATACTACAATT ->/ ATATGCAATCATTATTTTGTTTATATTGTAGAAG
2035446 (b6) // 2035447 (dm6 and b6)
(softclipped sequence) //
(AAGATCTACTAGGGTGACCCTAAGGAATTAGGGTGGTCCTAAGTTTACTTATTAGTCGACTTATTATTATGATTCATATTATAATTATTATTAATATTATTATTATTATTGTTATTATTATTGTTATTATGTTCGTATATACTAC) //
ATT//
TCGTATATATTTCAAACGGTTGTCTTCATACAAAACTGTATTACTTATT
2035438 (dm6)
Let me try to put together a better example. I will do it next week, as I am at the big TACG conference right now. I am clearly not articulating what I see as the problem very effectively. If I can get you to see my high level concern, I think everything else will fall into place. But I have to do a better job.
1. What were you trying to do?
2. What actually happened?
3. What did you want to happen?
I would think these two approach would give the same number of counts. And indeed they do for the monomorphic fragments upstream of the large insertion. And for the segment immediately upstream of the insertion in the insertion free strain. But the counts disagree for the segment immediately upstream of the insertion for the insertion containing strain (i.e., 1518 vs. 2610 = Strain B for "2035438"). I am inclined to believe the gaf counts, as the ratio of A/B counts is roughly constant for the 3 segments. That is 1814/1518 ~= 2073/1672 ~= 1833/1601 reflecting differences in raw read coverage.
What is going on with surject/bam for strain B for the hunk of the genome immediately upstream of the TE? Looking at samflags seems to suggest an answer.
In the B bam file the total number of weird flags (flags = 137 or 73) = 1507, which is strangely close to 1518 target. The 137/73 flags are “mate not mapped” … but of course the mate is mapped in the gaf file!! The normal flags (= 163 or 99) are mate mapped, which is the correct situation but the wrong count. It feels like there is double counting or something - despite nothing being called a secondary alignment. It is not as simple as throwing out the “weird” flags … as they seem to be the correct result, and the normal flag incorrect. If I use the bam to obtain coverage the results seem incorrect.
I am concerned the surject bam files are likely generally incorrect for reads mapping near the locations of insertions relative to the reference. It is conceivable the insertions need to be repetitive to reproduce this phenomena. In fact, I suspect if the insertions are unique sequences private to a non-reference strain then surject gets the answer correct.
5. What data and command can the vg dev team use to make the problem happen?
I can provide much more. But thought it might be useful to start with less
I can of course supply all code and data, perhaps very little is needed to reproduce the result. Maybe just the two gaf.gz files and the gbz file?
tar.gz here
6. What does running
vg version
say?