vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.1k stars 194 forks source link

"vg prune" took too long time #3690

Open Sunhh opened 2 years ago

Sunhh commented 2 years ago

1. What were you trying to do? I want to build GCSA index for a VG file generated from the Minigraph-Cactus Pangenome Pipeline. Because I met a "Size limit exceeded" problem in "vg index" run, I tried to prune this graph before the indexing.

2. What did you want to happen? I wanted to simplify the variation graph for GCSA indexing.

3. What actually happened? The "vg prune" has been running for over 11 hours after throwing out a message of "Complement graph: 2367489 nodes, 2325013 edges in 408339 components" and is still running.

4. If you got a line like Stack trace path: /somewhere/on/your/computer/stacktrace.txt, please copy-paste the contents of that file here: None.

5. What data and command can the vg dev team use to make the problem happen? After executing "cactus-graphmap-join", I got this problem.

cactus-graphmap-join  ./jobstore  --batchSystem single_machine  --vg          w3-pg/clip-W97-clip/align-W97.vg  --hal         w3-pg/W97-clip.hal  --outDir      ./w3-pg/  --outName     W97-minaf.0.1  --reference   W97  --wlineSep    "."  --vgClipOpts  "-d 2 -m 1000 -P U531 -P Cord"  --preserveIDs  --giraffe      --nodeStorage 1000  --indexCores  64  --realTimeLogging  --logFile     w3-pg/W97-minaf.0.1.join.log

vg prune -t 50 -u -g w3-pg/W97-minaf.0.1.gbwt -m w3-pg/node_mapping-W97-minaf.0.1 w3-pg/clip-W97-minaf.0.1/align-W97.vg -p -M 32 > w3-pg/W97-minaf.0.1.pruned.vg

6. What does running vg version say?

vg version v1.40.0 "Suardi"
Compiled with g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 on Linux
Linked against libstd++ 20210601
Built by stephen@lubuntu

Thank you!

adamnovak commented 2 years ago

It sounds like you may be asking vg prune to do a lot of work. I think it did the pruning away of pathologically complex regions, and that split your graph into 408,339 different pieces, and now it is trying to string them back together by filling in the gaps with material from named paths (I think).

I wouldn't be surprised if it took a day or two and a lot of memory to do this, at whole genome scale.

@glennhickey When you do vg prune pruning on Cactus/Minigraph graphs, how long do you usually have to wait?

408,339 different pieces really is a lot, though. How confident are you that the alignments going into this are good and reflective of evolutionary history at a consistent age, and not pathologically complicated and collapsing paralogs together?