pangenome / odgi

Optimized Dynamic Genome/Graph Implementation: understanding pangenome graphs
https://doi.org/10.1093/bioinformatics/btac308
MIT License
196 stars 40 forks source link

Extending existing graph with long reads #518

Open BaxW opened 1 year ago

BaxW commented 1 year ago

Hello!

I've been using odgi to explore graphs I've built from long read assemblies with both Minigraph/Cactus and pggb, and it's been incredibly useful (thank you !) I have some hifi data for some additional samples (individuals not already represented in the graph) that I'd like to try to use to extend an existing graph, but the hifi data is not sufficient depth for de novo assembly (10-15x coverage).

I noticed that there is a section about this in the FAQ for odgi which says:

"Here, our recommendation is to actually rebuild the graph with PGGB. One could use Graphaligner to align the long sequences against the graph and then use vg augment to extend the already existing graph, but that would be comparatively inexact and the resolutions of complex regions might drop dramatically. A reference-biased method would be Minigraph followed by Cactus."

So, if I want to extend an existing graph with lowish coverage long reads for additional samples I should: 1) assemble the reads as best I can (despite low coverage) and re-build the graph with pggb or Minigraph/Cactus ...or... 2) align the reads to the existing graph and use vg augment (less ideal)

Am I understanding this correctly?

ekg commented 1 year ago

Apologies that the documentation on this isn't very good!

In principle you can simply include nanopore reads in your input sequences, along side the other reference genomes you want to include. That could be a subset or a whole pangenome.

You may want to do this in a single region or chromosome at a time using reference alignment or mapping to collect reads and contigs by locus. In principle you can use many scaffolded references for this, but the most we tested was 2 (chm13+grch38).

There is also reference free partitioning but it is worth noting that it's hard for large numbers of sequences. We can link docs if you don't find them immediately.

Once you've collected a set of nanopore sequences and pangenome assemblies, it is possible to put them into pggb as inputs. Then there will be some paths that correspond to nanopore reads and some that correspond to assemblies.

Downstream it does get harder to work with this. There isn't a strong pipeline to use the nanopore sequences this way and then make variant calls from the aligned sample. Tools in VG should work, but I'm not sure if they can handle the diversity between the nanopore reads when these get included.

If this isn't making sense, please let me know what needs more clarification.

On Wed, Jul 12, 2023, 18:57 Baxter Worthing @.***> wrote:

Hello!

I've been using odgi to explore graphs I've built from long read assemblies with both Minigraph/Cactus and pggb, and it's been incredibly useful (thank you !) I have some hifi data for some additional samples (individuals not already represented in the graph) that I'd like to try to use to extend an existing graph, but the hifi data is not sufficient depth for de novo assembly (10-15x coverage).

I noticed that there is a section about this https://odgi.readthedocs.io/en/latest/rst/faqs.html#graph-constructed-from-long-read-or-sequence-data-extension-with-long-reads-or-sequences in the FAQ for odgi which says:

"Here, our recommendation is to actually rebuild the graph with PGGB https://github.com/pangenome/pggb. One could use Graphaligner https://github.com/maickrau/GraphAligner to align the long sequences against the graph and then use vg augment to extend the already existing graph, but that would be comparatively inexact and the resolutions of complex regions might drop dramatically. A reference-biased method would be Minigraph https://github.com/lh3/minigraph followed by Cactus https://github.com/glennhickey/progressiveCactus."

So, if I want to extend an existing graph with lowish coverage long reads for additional samples I should:

  1. assemble the reads as best I can (despite low coverage) and re-build the graph with pggb or Minigraph/Cactus ...or...
  2. align the reads to the existing graph and use vg augment (less ideal)

Am I understanding this correctly?

— Reply to this email directly, view it on GitHub https://github.com/pangenome/odgi/issues/518, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEKQFTVTNAD5X64MCZTXP3JOHANCNFSM6AAAAAA2HYST7U . You are receiving this because you are subscribed to this thread.Message ID: @.***>

BaxW commented 1 year ago

Okay yes that makes sense, thanks! Out of curiosity, what advantage would this approach have over using vg augment?