vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.11k stars 195 forks source link

Lower number of mapped reads in augmented graph? #2267

Open cgroza opened 5 years ago

cgroza commented 5 years ago

Hi,

Please describe:

  1. What you were trying to do I aligned a whole genome sequencing data set to a human genome augmented with SNPs and indels. Then I aligned the same dataset to the same graph, but with additional large insertions added to it.
  2. What you wanted to happen Normally, since the smaller graph is a subset of the larger graph, I exprected that the mapping rate to stay the same or increase.
  3. What actually happened However, running vg stats on the GAM shows that the larger graph aligns about 3000 fewer reads than the smaller graph.

This is very counterintuitive to me. Do you know how adding extra sequence nodes to a graph could decrease the mapping rate?

Are there any factors that I am not aware of here?

Thank you in advance.

glennhickey commented 5 years ago

Adding a variant into a graph can increase ambiguity thereby reducing the MAPQ of a given read. But stats -a uses the score to determine if the read's aligned, and it is also counterintuitive to me that adding a variant would reduce that.

ekg commented 5 years ago

How long are your reads, and how large is the graph?

One thing you can do that might help is to fix the minimum MEM length of the mapper, such as by adding -k 18 to both invocations of your mapping. See if this changes things.

This parameter is estimated based on the size of the graph that we're mapping against.

On Thu, May 16, 2019 at 6:33 PM Glenn Hickey notifications@github.com wrote:

Adding a variant into a graph can increase ambiguity thereby reducing the MAPQ of a given read. But stats -a uses the score to determine if the read's aligned, and it is also counterintuitive to me that adding a variant would reduce that.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/2267?email_source=notifications&email_token=AABDQEK4HJIDTF572YZXM2DPVWLGVA5CNFSM4HNDXWLKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVSQYQI#issuecomment-493161537, or mute the thread https://github.com/notifications/unsubscribe-auth/AABDQEK4CLPW2FHEHSQN6JDPVWLGVANCNFSM4HNDXWLA .

cgroza commented 5 years ago

They are 100bp paired-end reads. Both genomes are full human genomes with SNV, indels, and insertions (for the larger graph) added from a single individual (NA12878).

The larger graph has a size of: 3093174283bp The smaller graph has a size of: 3092723528bp

It's not a very large difference overall. I will still try your suggestion and will update this post.

cgroza commented 5 years ago

So I have tried what was suggested. I passed -k 18 when aligning my reads, and it did fix the issue. The number of mapped reads increased slightly as I added large insertions, deletions. However, when I added inversions, it decreased again.

Here is a table of my results:

Graph Aligned reads Gain from previous
+ INS + DEL + INV 176214386 -2077
+ INS + DEL 176216463 1392
+ INS 176215071 1586
SNPs + indels 176213485 71079
hg19 (graph) 176142406 0

As you can see, I get a large increase in mapped reads when I add SNPs and indels, followed by smaller increases when I add large insertions and deletions, then a drop when I add the inversions.

Could this be related to indexing? I am using GCSA2 after prunning the graph.