Open cgroza opened 5 years ago
Adding a variant into a graph can increase ambiguity thereby reducing the MAPQ of a given read. But stats -a
uses the score to determine if the read's aligned, and it is also counterintuitive to me that adding a variant would reduce that.
How long are your reads, and how large is the graph?
One thing you can do that might help is to fix the minimum MEM length of the mapper, such as by adding -k 18 to both invocations of your mapping. See if this changes things.
This parameter is estimated based on the size of the graph that we're mapping against.
On Thu, May 16, 2019 at 6:33 PM Glenn Hickey notifications@github.com wrote:
Adding a variant into a graph can increase ambiguity thereby reducing the MAPQ of a given read. But stats -a uses the score to determine if the read's aligned, and it is also counterintuitive to me that adding a variant would reduce that.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/2267?email_source=notifications&email_token=AABDQEK4HJIDTF572YZXM2DPVWLGVA5CNFSM4HNDXWLKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVSQYQI#issuecomment-493161537, or mute the thread https://github.com/notifications/unsubscribe-auth/AABDQEK4CLPW2FHEHSQN6JDPVWLGVANCNFSM4HNDXWLA .
They are 100bp paired-end reads. Both genomes are full human genomes with SNV, indels, and insertions (for the larger graph) added from a single individual (NA12878).
The larger graph has a size of: 3093174283bp The smaller graph has a size of: 3092723528bp
It's not a very large difference overall. I will still try your suggestion and will update this post.
So I have tried what was suggested. I passed -k 18 when aligning my reads, and it did fix the issue. The number of mapped reads increased slightly as I added large insertions, deletions. However, when I added inversions, it decreased again.
Here is a table of my results:
Graph | Aligned reads | Gain from previous |
---|---|---|
+ INS + DEL + INV | 176214386 | -2077 |
+ INS + DEL | 176216463 | 1392 |
+ INS | 176215071 | 1586 |
SNPs + indels | 176213485 | 71079 |
hg19 (graph) | 176142406 | 0 |
As you can see, I get a large increase in mapped reads when I add SNPs and indels, followed by smaller increases when I add large insertions and deletions, then a drop when I add the inversions.
Could this be related to indexing? I am using GCSA2 after prunning the graph.
Hi,
Please describe:
This is very counterintuitive to me. Do you know how adding extra sequence nodes to a graph could decrease the mapping rate?
Are there any factors that I am not aware of here?
Thank you in advance.