pangenome / pggb

the pangenome graph builder
https://doi.org/10.1101/2023.04.05.535718
MIT License
355 stars 38 forks source link

Producing subgraphs of large pangenomes #309

Closed ASLeonard closed 1 year ago

ASLeonard commented 1 year ago

Hi,

I was thinking about the inverse problem of #306. If we start with a large graph with many samples, we may be interested in subgraphs covered by a subset of those samples (e.g., we build a graph for the entire genus, but now we want a graph of only one species). This is assuming there is a large benefit to working with a less complex graph (less CPU/RAM to handle alignment/plotting/etc ).

Do you think something like odgi extract --paths-to-extract <all samples for that species> ... would be sufficient to have a sensible topology (maybe we could then re-run gfaffix to simplify any stray nodes)? The alternative potentially would be to just subset the wfmash paf and only keep rows where both target & query are in the subset, but that is likely a substantially longer process to continue from seqwish and only saving the alignment stage. Even this has some risks as the original divergence parameter passed to wfmash would be for the genus-level and not species-level, and so may be suboptimal at some within-species alignments.

Thanks, Alex

ekg commented 1 year ago

It should be possible to do this following these steps:

Remove all but the desired paths from the graph. Not remembering if this is in odgi paths or if it's cleaner to do it by removing the paths from the GFA and rebuilding.

Use odgi prune to remove 0-coverage nodes.

Use odgi unchop to merge redunant runs of nodes where there is no longer variation.

Reapply the sorting pipelines used in pggb (odgi sort) to get the graph in an intelligible shape for 1D and 2D visualization.

The resulting graph just has the genomes of interest and it represents their relationship in the original graph.

You can then process the graph as you see fit. For instance by extracting regions of interest for fine scale inspection.

ASLeonard commented 1 year ago

Thanks Erik, I'll give this a go and see how it compares.