vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.1k stars 194 forks source link

mpmap takes a long time and lots of memory to map one 8kbp sequence to a small but complex graph #2728

Open jmonlong opened 4 years ago

jmonlong commented 4 years ago

What were you trying to do?

Map one 8 kbp sequence to a small but complex graph. The sequence contains 3-4 kbp flanks from the reference genome that should match exactly one of the path in the graph. There is a short tandem repeat in the graph though and some variation already so it's a bit complex. The variation in the graph comes from vg construct and doing mpmap+augment a few times.

What did you want to happen?

Get a mapping without using most of the shared cluster's memory and in less than 1h.

What actually happened?

It runs for a long time and uses a lot of memory. I'm not sure why because it's just one read and a small graph. Playing with prune or mpmap parameters didn't seem to help.

What data and command can the vg dev team use to make the problem happen?

I placed data and commands in courtyard at: /public/groups/vg/jmonlong/vg_issues/mpmap_long_highmem_complexgraph

It's a tiny graph so also in this zip: mpmap_long_highmem_complexgraph.zip

This is what I used to make the graph and call mpmap:

vg index -x graph.xg graph.vg

## small graph but complex
vg stats -zl graph.vg
# nodes   986
# edges   1366
# length  12661

vg prune -M 32 graph.vg > graph.pruned.vg
vg index -g graph.gcsa graph.pruned.vg
vg mpmap -E -B -S -f read.fq -x graph.xg -g graph.gcsa > read.gam

What does running vg version say?

vg version v1.23.0 "Lavello"
Compiled with g++ (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 on Linux
Linked against libstd++ 20191114
Built by anovak@octagon
jeizenga commented 4 years ago

Yeah, there are several reasons why this might have happened based on your description of the graph. I'm working right now on getting a better set of default parameters for mapping long sequences, which might resolve this. How large are the largest structural variants you have in this catalog?

jmonlong commented 4 years ago

Thanks, let me know if you want me to try parameters out. For now, I ended up using an aggressive pruning (-M 4). It doesn't get stuck anymore but doesn't seem to work that well. Looks like some SVs are not added, not sure why yet.

In my "augmentation" experiment, I can control the size of the flanks (that match the reference exactly) in the SV-containing sequence to align. I made sure flanks are at least the size of the largest SV, to avoid soft-clips. Do you think even longer "reference flanks" would help? Or it's really the complexity of the graph that is the problem?

In term of SV size, the largest deletion in this example is 255bp and I was mapping+augmenting a few dozens insertions, the largest being 960bp. In the full catalog the longest SVs reach ~100kbp in size.

At this point I'm only mapping/augmenting SVs <1kbp though. I first add the large ones using vg construct, then map/augment the ones <1kbp. So the large SVs are in the graph but I'm not trying to map sequences with SVs >1kbp at this point.

jmonlong commented 4 years ago

Just a comment that these small graphs that take a long time to map to shouldn't contain any cycles. I just checked the one attached to this issue and I don't see any, just many big bubbles.

graph.pdf

jmonlong commented 4 years ago

For info, I tested with the most recent vg (commit 9f33af6) that includes the recent mpmap PR #2758 and see the same behavior. I aborted when mapping the read was taking more than 30% of plaza's memory (~300 Gb I believe).

jeizenga commented 4 years ago

Okay, thanks. I'll try to look at this soon

jeizenga commented 4 years ago

Looked at this a little bit. On some level the culprit is this 1bp self-loop in the middle of a TATA repeat. Is it supposed to be there? It would probably be best if mpmap didn't blow up on cases like these, but in the meantime, the alignment would probably go fine if this edge were removed.

Screen Shot 2020-05-06 at 6 29 03 PM
jmonlong commented 4 years ago

Thanks for investigating. To be honest, I didn't expect there would be self-loops. This graph was made using vg construct from a VCF and then a couple of rounds of mpmap-augment. I thought neither construct nor mpmap would create self-loops. I'll try to find out at what point this self-loop appears. If I can't find a problem, I agree that removing it would be the way to go.

Can you imagine situations when augmenting a mpmap alignment create a self-loop?

jmonlong commented 4 years ago

@glennhickey maybe you have some ideas about this? Do you think self-loops could arise during augmentation? Maybe to simplify the graph, adding a self-loop instead of adding a 1 bp insertion for example? (just spit balling)