GSCA index creation fails on PGGB graph

vgteam / vg

tools for working with genome variation graphs

Other

1.09k stars 193 forks source link

1. What were you trying to do?

I was trying to create a gsca index of a graph created with pggb.

2. What did you want to happen?

The index to be created successfully (as happened with the xg index)

3. What actually happened?

The process was killed. I knew that this process can take quite a lot of memory so I kept iteratively increasing the memory allocation up to 2TB. Starting at a 128GB memory and all the way up to the allocation of 2TB the process always died in the same place:

GCSA::GCSA(): Step 4 (path length 128 -> 256)
PathGraph::prune(): 13479701948 -> 12619578206 paths (9874731106 ranges)
PathGraph::prune(): 8674004370 unique, 121405195 redundant, 3352079029 unsorted, 472089612 nondeterministic paths
PathGraph::prune(): 547.652 GB in 103 file(s)
PathGraph::read(): File 0: Read 705385109 order-128 paths
PathGraph::extend(): File 0: Created 2435258710 order-256 paths
PathGraph::read(): File 0: Read 2435258710 order-256 paths
PathGraphBuilder::sort(): File 0: Sorted 2435258710 paths
PathGraph::read(): File 1: Read 875387596 order-128 paths
PathGraph::extend(): File 1: Created 2634508175 order-256 paths
PathGraph::read(): File 1: Read 2634508175 order-256 paths
PathGraphBuilder::sort(): File 1: Sorted 2634508175 paths
PathGraph::read(): File 2: Read 717201532 order-128 paths
PathGraph::extend(): File 2: Created 53166860410 order-256 paths
Killed

4. If you got a line like Stack trace path: /somewhere/on/your/computer/stacktrace.txt, please copy-paste the contents of that file here:

There was no stack trace path referred to in the error.

5. What data and command can the vg dev team use to make the problem happen?

The data here are eight chromosome scale assemblies just a bit over 2Gb in length. The command was as follows:

vg index -g pggb.gsca -Z 50000 -t 48 -p -b ./tmp $(for i in $(seq 0 102); do echo group.${i}.mod.prune.vg; done)

6. What does running vg version say?

This process was attempted with both v1.47.0 and v1.48.0, with the same results.

You would want to heavily prune this DBG that GCSA2 constructs. See vg prune. You can also index only up to a relatively short k. You can put in a DBG of order 11 and double twice and you get k=44. This too can go wrong...

Honestly, it will be difficult to do this and I'm afraid of leading you astray. A major blocker with graph aligners is that, if they consider all possible paths through the graph (all possible recombinant haplotypes) then the search space that they need to consider can be exponentially larger than the actual set of sequences in the graph.

Another major blocker is how the aligner handles complex sequence motifs, dense graph structures, and big bubbles. I guess this is generally the graph "topology". I have not even tested vg map on PGGB graphs. And, we are having major difficulty getting vg giraffe to work on it. Even running vg giraffe on the minigraph-cactus output requires extensive clipping (via vg clip). We can do a similar thing to PGGB but there are still problems with the downstream pipelines like DeepVariant calling.

Our current approach is to align reads to the pangenome directly, and then inject them into the graph using gfainject. This converts BAM into GAF, so long as the alignments are to the paths in the graph, it's a lossless conversion.

We need a final step to harmonize the mapping with the graph so as to capture recombinants---mapping to the linear pangenome and injecting doesn't give us those. There may be some exact match solutions to this, and it could be very cool to apply the match extension functions in giraffe to do this to an arbitrary input GAF. Let's just say that this alignment harmonizer remains an open problem, but it's similar to some of the local realignment methods in the mappers we have and I don't think insoluble. It could also be dealt with during variant calling or via surjection.

Note that this process is something we're still learning about. Right now, we're exploring the best configuration for bwa mem that makes the process efficient. It is not as fast as working with a single reference, but most of the problems seem to have to do with assumptions about how many homologous copies of the genome there are. The additional information in the pangenome can let us be more stringent about matching criteria, which I expect will ultimately give this process the same kind of performance boost as what we see with vg giraffe. I'll be happy if we can be a small multiple of bwa mem runtime and still map against the complex graphs that PGGB is making.

The fact that biology appears to be driving graph complexity is reason that I'm trying to avoid the clipping and pruning techniques that were important to vg over the years. I want to have a mapping pipeline that has predictable and linear performance even if the graph is topologically complex.

vgteam / vg

GSCA index creation fails on PGGB graph #3965