vg mpmap mapping reads to pan-transcriptom too slow

jkfo002 commented 2 months ago

1. What were you trying to do?

I used PGGB pipeline to construct a plant graph pangenome for within 7 individual from a plant specie.

I deconstructed the graph and converted into pg graph with vg and used following pipeline to construct a spliced graph

$vg rna -p -t 52 -s Parent \
-n S1_chr03.gff3 \
-n S2_chr03.gff3 -n S3_chr03.gff3 \
-n S4_chr03.gff3 -n S5.gff3 \
-b ${output}.spliced.gwbt -f ${output}.spliced_pantrans.fa -i ${output}.spliced.txt \
${output}.mod.pg > ${output}.mod.spliced.pg

# 2 index
spliced_pg=${output}.mod.spliced.pg
prefix=$(basename $spliced_pg .pg)
## discance .xg
$vg index -t 52 -x ${prefix}.xg ${spliced_pg}
$vg snarls -t 52 -T ${prefix}.xg > ${prefix}.snarls
$vg index -t 52 -x ${prefix}.xg -j ${prefix}.dist
## gbwt
$vg gbwt -p --num-threads 52 -r ${output}.spliced.gwbt.ri ${output}.spliced.gwbt

I tried to mapping reads to the spliced graph like,

# mapping reads
f1=$1
f2=$2
prefix=../test/S.chr03_PanSN.softmasked.mod
xg=${prefix}.spliced.xg
gcsa=${prefix}.pruned.gcsa
dist=${prefix}.spliced.dist
output=$(basename ${f1} _1.fq.gz)
$vg mpmap -t 52 \
-n rna -f $f1 -f $f2 \
-x ${xg} -g ${gcsa} -d ${dist} \
> ${output}.aligned.gamp

2. What did you want to happen?

I want to get the gmap for downstream analysis in rpvg

3. What actually happened?

Mapping step seem crazy slow without any output

4. If you got a line like Stack trace path: /somewhere/on/your/computer/stacktrace.txt, please copy-paste the contents of that file here:

Place stacktrace here.

5. What data and command can the vg dev team use to make the problem happen?

How should I deal with it or any suggestion about how to simply the graph？

6. What does running vg version say?

version v1.55.0 "Bernolda"

adamnovak commented 2 months ago

Can you quantify how fast you need mpmap to be, @jkfo002? How long did you wait for output before concluding it was stuck? How many reads do you have, how long are they in bases, and how many hours can you wait to map them?

We have previously had trouble with plant genomes, especially from PGGB. We would usually hit a problem at distance index construction, when we'd have enormous and intractable snarls that would call for enormous distance matrices that we couldn't afford to fill in. I think we solved that problem by giving up on those snarls and resorting to graph traversals at runtime to find distances within them, but that can make aligning reads that go through them very slow.

Our distance indexing strategy just isn't applicable to some PGGB graphs that put thousands of nodes into what we want to think of as a single variable site, without any recognizable substructure. You might need to simplify the PGGB graph to reduce the maximum size of snarls, by making it align fewer paralogs together. While vg mpmap can support cycles in graphs, it needs a graph that looks a lot like a series-parallel graph, maybe with some loops that let you go through a region multiple times, in order to be fast enough to be useful. We don't have an approach to make it fast on graphs with big tangles, which PGGB will happily produce if not severely constrained.

jkfo002 commented 2 months ago

Sorry for the vague question.

In my situation, I built a graph with in one chromosome for about 65 Mb and my RNA-seq were total 23,131,329 paired reads. I am trying to mapping reads to this graph, but it has run the vg mpmap for about 10 days up to now and seems to struggled.

Should I used vcfbub to simple my graph or do you have any suggestion with it ?

By the way, I also used vg stats to give a statistic about my graph

nodes   3358542
edges   4509907
length  85437503
node-id-range   1:3358542

jeizenga commented 2 months ago

Did you also subset the reads to this chromosome somehow? If not, then part of the problem is that you are using a single chromosome graph. When you do that, there are many apparently high-quality reads that don't actually have anywhere to map in the reference. In that situation, vg mpmap puts a lot of computational effort into trying to find a correct mapping location that doesn't actually exist. If you work with genomewide reads, you need a genomewide graph.

adamnovak commented 2 months ago

I'm worried that it mapped 25k pairs and got 0 pairs mapped unambiguously to learn the fragment length distribution. That says bad things about the graph structure.

If you pull out a small region around some arbitrary node (say node 1234), and render it with GraphViz:

vg find -n1234 -c5 -x ${prefix}.xg | vg view -dp - | dot -Tpng -o region.png

Does it look like a stick with variation or does it look like a tangle?

It could also be the case that maybe some samples didn't merge in and so there are duplicate copies of a lot of the graph, and that's why it is not getting any well-mapped reads.

If you do vg stats --subgraphs ${prefix}.xg it will tell you about the connected components of your graph. If you have more than one line in the output your graph is broken up into multiple disconnected pieces, and it's possible one or more of your chromosome copies didn't align in. (But if the chromosome is about 65 megabases, and the graph is about 85 megabases, there isn't room for a completely disconnected chromosome copy.)

jkfo002 commented 2 months ago

Sorry for late reply, I have tried to extract the RNA-Seq reads which could map to the reference chromosome I used to construct the graph and mapped to my graph. Now it works very well.

It seems like @jeizenga said I didn't construct the whole-genome graph.

I also tried the @adamnovak suggestion and test for several nodes, I wander if there were any ways to sort the complex of the nodes and I could see the tangle? It's just for curiosity. region_12332 region_23451

I also did vg stats --subgraphs ${prefix}.xg and the result didn't appear multiple lines. 1,2091,3083,3567 85437503

Thanks for the kind helps @adamnovak @jeizenga

jeizenga commented 2 months ago

Depending on the size of your graph, you might be able to use Bandage-NG to visualize it. If you do, I would recommend looking at a single chromosome in the un-spliced graph. Splicing motifs can be visually misleading in Bandage.

adamnovak commented 2 months ago

Those graph images look OK to me; the basic shape is a stick with variants along it. We're not looking at an enormous tangle at these locations. (By "tangle" I mean the informal concept of a graph where lots of things connect to lots of other things in a difficult to understand way.)

The first image does have those two long nodes at the top, with nothing connected to them on the left. Either the graph extraction stopped there and int he full graph it continues on, or else there really isn't anything connected on the left and this is an internal "tip" or "stub". Graphs with lots of internal tips can also be unusually hard to work with, and we often trim them off when doing mapping. The vg clip command has --stubs and --stubify-paths options that I think @glennhickey wrote that can be used to remove these. Maybe that would help?

If it isn't really a tip, then we're looking at a place where two sticks are joining up, and we might expect the Bandage visualization to have some big sweeping loops in it. Maybe we're not looking at a tangle at a small scale, but there's still enough complexity in the connections that the graph is not amenable to the snarl decomposition?

jkfo002 commented 2 months ago

Thank you very much for the detailed explanation @adamnovak and @jeizenga ! I will try the advice. And I will close this tissue.

vgteam / vg

vg mpmap mapping reads to pan-transcriptom too slow #4267