vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.07k stars 191 forks source link

vg surject: incorrect behavior near polymorphic {repetitive} insertions #4235

Closed tdlong closed 3 months ago

tdlong commented 4 months ago

1. What were you trying to do?

  1. I build a 3 genome pangenome using the singularity container and the --chop switch. The 3 genomes are reference + A + B I generate a large number of simulated reads from A or B for a 10 kb regions for which B contains a large 7.3kb transposable element (TE) insertion. The thing about TE insertions is that they are like SINEs of LINEs and occur multiple times in the genome at many different locations (and some of those locations are private to a strain).
  2. I align the reads back to the pangenome using giraffe. And surject the resulting gaf files.
  3. I choose 3 segments monomorphic in the 3 genomes, one immediately upstream of the TE, one somewhat up stream, and one further upstream.
  4. I can simply count the number of times I see each fragment in the gaf file in the “forward” orientation, and do the same thing in the surject bams

2. What actually happened?

# 3 interesting fragments
name -> segment identity = sequence = chr3R coordinates of sequence
at TE -> 2035438 = TCGTATATATTTCAAACGGTTGTCTTCATACAAAACTGTATTACTTATT = 18402675-18402723
a little upstream -> 2035390 = TTTTTTTTCTTACTCCTCTCCGCAGCTTGTGACGTGTCTCTGTCTTTGTCTGGGACTGCCTCAAATCGGTATGTCTTACA = 18401215-18401294
further upstream -> 2035414 = CTATTAGTTCCAACCGTATAATAACAATTTAGTGCAGTAGTTTCAGCTTTCTCTCGTTC = 18401587-18401645

# count them in gaf
zcat A.gaf.gz | grep ">2035438" | wc -l # 1814
zcat B.gaf.gz | grep ">2035438" | wc -l # 1518
zcat A.gaf.gz | grep ">2035390" | wc -l # 2073
zcat B.gaf.gz | grep ">2035390" | wc -l # 1672
zcat A.gaf.gz | grep ">2035414" | wc -l # 1833
zcat B.gaf.gz | grep ">2035414" | wc -l # 1601

# count them in surject

samtools view A.surject.sort.bam -F 20 dm6#0#chr3L:18402675-18402723 | wc -l #1814
samtools view B.surject.sort.bam -F 20 dm6#0#chr3L:18402675-18402723 | wc -l #2610
samtools view A.surject.sort.bam -F 20 dm6#0#chr3L:18401215-18401294 | wc -l #2073
samtools view B.surject.sort.bam -F 20 dm6#0#chr3L:18401215-18401294 | wc -l #1672
samtools view A.surject.sort.bam -F 20 dm6#0#chr3L:18401587-18401645 | wc -l #1833
samtools view B.surject.sort.bam -F 20 dm6#0#chr3L:18401587-18401645 | wc -l #1601

3. What did you want to happen?

I would think these two approach would give the same number of counts. And indeed they do for the monomorphic fragments upstream of the large insertion. And for the segment immediately upstream of the insertion in the insertion free strain. But the counts disagree for the segment immediately upstream of the insertion for the insertion containing strain (i.e., 1518 vs. 2610 = Strain B for "2035438"). I am inclined to believe the gaf counts, as the ratio of A/B counts is roughly constant for the 3 segments. That is 1814/1518 ~= 2073/1672 ~= 1833/1601 reflecting differences in raw read coverage.

What is going on with surject/bam for strain B for the hunk of the genome immediately upstream of the TE? Looking at samflags seems to suggest an answer.

samtools view A.surject.sort.bam -F 20 dm6#0#chr3L:18402675-18402723 | cut -f 2 | sort | uniq -c
#count flag 
    926 163
    888 99

samtools view B.surject.sort.bam -F 20 dm6#0#chr3L:18402675-18402723 | cut -f 2 | sort | uniq -c
    766 137
    545 163
    741 73
    558 99

In the B bam file the total number of weird flags (flags = 137 or 73) = 1507, which is strangely close to 1518 target. The 137/73 flags are “mate not mapped” … but of course the mate is mapped in the gaf file!! The normal flags (= 163 or 99) are mate mapped, which is the correct situation but the wrong count. It feels like there is double counting or something - despite nothing being called a secondary alignment. It is not as simple as throwing out the “weird” flags … as they seem to be the correct result, and the normal flag incorrect. If I use the bam to obtain coverage the results seem incorrect.

I am concerned the surject bam files are likely generally incorrect for reads mapping near the locations of insertions relative to the reference. It is conceivable the insertions need to be repetitive to reproduce this phenomena. In fact, I suspect if the insertions are unique sequences private to a non-reference strain then surject gets the answer correct.

5. What data and command can the vg dev team use to make the problem happen?

I can provide much more. But thought it might be useful to start with less

# wgsim is part of samtools, A.small.fa is the 20kb region for the strain w/o the TE
wgsim -1 150 -2 150 -r 0 -R 0 -X 0 -e 0 -d 700 -s 70 -N 200000 A.small.fa A_1.fq A_2.fq
vg giraffe -m pan.min -d pan.dist -Z pan.gbz -f A_1.fq -f A_2.fq -N A -o gaf | gzip -c >A.gaf.gz
zcat A.gaf.gz | grep ">2035438" | wc -l 
vg surject -x pan.gbz -i -b -G A.gaf.gz | samtools view -q 30 -b > A.surject.bam
samtools view A.surject.sort.bam -F 20 dm6#0#chr3L:18402675-18402723

I can of course supply all code and data, perhaps very little is needed to reproduce the result. Maybe just the two gaf.gz files and the gbz file?

tar.gz here

6. What does running vg version say?

singularity pull cactus2/cactus.sif docker://quay.io/comparative-genomics-toolkit/cactus:v2.7.2
singularity pull cactus2/vg.sif docker://quay.io/vgteam/vg:v1.55.0
jeizenga commented 3 months ago

I've looked into this issue, and I believe all of this is in fact expected behavior. I'll try to clarify.

This is the relevant subgraph:

S   2035438 TCGTATATATTTCAAACGGTTGTCTTCATACAAAACTGTATTACTTATT
S   2035446 GTACATGCTGAATTCGCATATTTAGTATTAGGGTGTATCTAAAGATCTACTAGGGTGACCCTAAGGAATTAGGGTGGTCCTAAGTTTACTTATTAGTCGACTTATTATTATGATTCATATTATAATTATTATTAATATTATTATTATTATTGTTATTATTATTGTTATTATGTTCGTATATACTACAATT
S   2035447 ATATGCAATCATTATTTTGTTTATATTGTAGAAG
L   2035438 +   2035447 +   0M
L   2035446 +   2035447 +   0M

As you note, node 2035438 is not found in the B genome, but it is found the the dm6 reference. In the B genome, node 2035446 comes immediately before 2035447, and 2035446 is not found on dm6. Accordingly, for any alignment that spans 2035446 and 2035447, the left end of the alignment (on 2035446) has to be realigned onto the dm6 reference, where it will try to align to 2035438. You'll notice that 2035446 and 2035438 have 3 bases of identical sequence on the 3' end. Thus, the realignment will find at least 3 matching bases on 2035438 before soft-clipping the rest of the read.

Here's an example of one of these read pairs' surjected graph alignment before converting to BAM:

c1_17184_17655_0:0:0_0:0:0_3b97/1   150 0   150 +   <2035475<2035473<2035472<2035470<2035469<2035467<2035466<2035464<2035463    199 8   158 148 150 60  AS:i:150    bq:Z:IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII cs:Z::129*GA:2*GA:17    fn:Z:c1_17184_17655_0:0:0_0:0:0_3b97/2  pd:b:1
c1_17184_17655_0:0:0_0:0:0_3b97/2   150 0   150 +   >2035438>2035447    83  46  50  4   150 60  AS:i:9  bq:Z:IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII cs:Z:+AAGATCTACTAGGGTGACCCTAAGGAATTAGGGTGGTCCTAAGTTTACTTATTAGTCGACTTATTATTATGATTCATATTATAATTATTATTAATATTATTATTATTATTGTTATTATTATTGTTATTATGTTCGTATATACTACA:4  fp:Z:c1_17184_17655_0:0:0_0:0:0_3b97/1  pd:b:1

The cause of the unexpected flags is related but somewhat different. The unmapped reads correspond to alignments that have 0 bases of overlap with the dm6 reference. vg surject reports these reads as unmapped, because the graph alignment can't be approximated on the corresponding reference. The reason you only see the unexpected flags on the B genome is because the B genome has long sequence in that range that are not shared with dm6, whereas in the A genome there are none in that range.

tdlong commented 3 months ago

I am not sure I understand your response. Segment 2035438 is monomorphic, meaning it exists in all 3 strains (reference, A & B). So given the way I carry out the simulation, I think a correct counting should result in a ratio of counts of fragment 2035438 in sample A versus B that is roughly constant (compared to other fragments present in all 3 strains, or my controls). This is exactly what I see if a query the gaf file directly -- where the count of 2035438 forward reads is 1518. Shouldn't the count be the same in the bam file if I query those same coordinates (But, I observe that it isn't)? It still seems to me that the correct behavior should be that directly querying a monomorphic fragment in the gaf and the bam should give the same count. I think that either the count in the gaf or the count in the bam is incorrect, how can they be different?

For monomorphic fragments the two approaches (gaf or bam) are measuring the same thing - the number of time a read overlaps that fragment. I think giraffe is getting this perfect.

The bigger phenomena, I did not present, is that if you make a coverage bed from the surject bams for sample A and B you see a drop in coverage in the 50-100 bp flanking the insertion point. This drop in coverage is very similar to the one you see when looking at the bams from a bwa-mem aligned to referect. But you can easily show you expect bwa-mem to get it wrong, because it doesn't know about the insert! I believe giraffe is getting the count right, but that information is not translating back to the bam in surject (you are 100% correct that I do not see why this is the case).

Is it OK for the gaf and the bam to disagree for a fragment this is present in ref, A, & B?

jeizenga commented 3 months ago

vg surject is implemented to preserve everything possible of the original graph alignment. When that is not possible, it realigns off-reference portions of the graph alignment to the reference sequence. This realignment step can cause the alignment to overlap regions on the reference where it didn't previously touch.

Here's an illustration to show how it worked in the example I gave above


                                                                                                                                              (read alignment before surject)
                                         AAGATCTACTAGGGTGACCCTAAGGAATTAGGGTGGTCCTAAGTTTACTTATTAGTCGACTTATTATTATGATTCATATTATAATTATTATTAATATTATTATTATTATTGTTATTATTATTGTTATTATGTTCGTATATACTACAATT------A
GTACATGCTGAATTCGCATATTTAGTATTAGGGTGTATCTAAAGATCTACTAGGGTGACCCTAAGGAATTAGGGTGGTCCTAAGTTTACTTATTAGTCGACTTATTATTATGATTCATATTATAATTATTATTAATATTATTATTATTATTGTTATTATTATTGTTATTATGTTCGTATATACTACAATT  ->  ATATGCAATCATTATTTTGTTTATATTGTAGAAG
                                                                                                                                                                           2035446 (b6)           /           2035447 (dm6 and b6)
                                                                                                                                                                                                 /      
                                                                                                                                                                                                /
                                                                                                                                                                                               /
                                                                                                                                             TCGTATATATTTCAAACGGTTGTCTTCATACAAAACTGTATTACTTATT 
                                                                                                                                                                           2035438 (dm6)

                                                                                                                                              (read alignment after surject)
                                                                                                                                                                                                    A
GTACATGCTGAATTCGCATATTTAGTATTAGGGTGTATCTAAAGATCTACTAGGGTGACCCTAAGGAATTAGGGTGGTCCTAAGTTTACTTATTAGTCGACTTATTATTATGATTCATATTATAATTATTATTAATATTATTATTATTATTGTTATTATTATTGTTATTATGTTCGTATATACTACAATT  ->/ ATATGCAATCATTATTTTGTTTATATTGTAGAAG
                                                                                                                                                                           2035446 (b6)          //           2035447 (dm6 and b6)
                                                                                                (softclipped sequence)                                                                          //      
                                         (AAGATCTACTAGGGTGACCCTAAGGAATTAGGGTGGTCCTAAGTTTACTTATTAGTCGACTTATTATTATGATTCATATTATAATTATTATTAATATTATTATTATTATTGTTATTATTATTGTTATTATGTTCGTATATACTAC)   //
                                                                                                                                                                                           ATT//
                                                                                                                                             TCGTATATATTTCAAACGGTTGTCTTCATACAAAACTGTATTACTTATT 
                                                                                                                                                                           2035438 (dm6)
tdlong commented 3 months ago

Let me try to put together a better example. I will do it next week, as I am at the big TACG conference right now. I am clearly not articulating what I see as the problem very effectively. If I can get you to see my high level concern, I think everything else will fall into place. But I have to do a better job.