pangenome / pggb

the pangenome graph builder
https://doi.org/10.1038/s41592-024-02430-3
MIT License
369 stars 40 forks source link

Consensus graph filled with 5000 bp blocks which might be trivially joined #101

Open ASLeonard opened 3 years ago

ASLeonard commented 3 years ago

Hi, I had a few questions about the consensus graph generation. I generated this with -C 100

image

There are many blocks of around 5000 bp which are essentially a linear chain, with only one in and one out edge. It feels like these should be joined into a single node, since they don't contain any variation. This length is suspiciously close to poa-length-target [5000], so I wasn't sure if this was related to that, where it won't produce longer nodes even if there is no bubble in it.

On a related note, I ran into a lot of issues initially, because I completed the initial run without -C (as previous default behaviour included the consensus graph). When I reran with --resume, this kept causing assertion errors in odgi build. I believe this is because in L305 of pggb, it only checks if a smooth.gfa exists (regardless of whether the consensus graphs exist), while later in L338 it tries to get stats on the consensus graphs (which were never made).

ekg commented 3 years ago

You are right, these can be trivially merged. The reason they aren't is that the merging process we had used had the side effect of corrupting the graph. We haven't yet had time to fix it, so we disabled the consensus path merging for the time being.

The larger nodes each represent one POA problem in the smoothing step. You will see this if you change pggb's -G parameter, which sets the target POA length.

On Wed, May 12, 2021, 14:02 Alex Leonard @.***> wrote:

Hi, I had a few questions about the consensus graph generation. I generated this with -C 100

[image: image] https://user-images.githubusercontent.com/29678761/117970483-0131bd00-b329-11eb-9f71-6ac31699bb43.png

There are many blocks of around 5000 bp which are essentially a linear chain, with only one in and one out edge. It feels like these should be joined into a single node, since they don't contain any variation. This length is suspiciously close to poa-length-target [5000], so I wasn't sure if this was related to that, where it won't produce longer nodes even if there is no bubble in it.

On a related note, I ran into a lot of issues initially, because I completed the initial run without -C (as previous default behaviour included the consensus graph). When I reran with --resume, this kept causing assertion errors in odgi build. I believe this is because in L305 of pggb, it only checks if a smooth.gfa exists (regardless of whether the consensus graphs exist), while later in L338 it tries to get stats on the consensus graphs (which were never made).

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pangenome/pggb/issues/101, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEMNSCULBVCI27GAUMLTNJU5BANCNFSM44YOWTJQ .

ASLeonard commented 3 years ago

It looks like there has been some heavy development since this was posted, is consensus reenabled?

ekg commented 3 years ago

Hi Alex,

You'll want to add e.g. -C 1000 to get the consensus graph. The major problems with the graph have been fixed, and in general it's probably usable. Consider it experimental and please let me know if you find any issues.

Erik

On Mon, Jul 19, 2021, 19:16 Alex Leonard @.***> wrote:

It looks like there has been some heavy development since this was posted, is consensus reenabled?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pangenome/pggb/issues/101#issuecomment-882719247, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQELJHVS2K52DCW4OSE3TYRMW3ANCNFSM44YOWTJQ .

ASLeonard commented 2 years ago

Hi @ekg, Returning to this after a long time after seeing some discussing in an odgi issue. The consensus graphs look to be much improved, shown here for values of 10, 100, and 1000. Clearly there is some beneficial simplification, but even at consensus target of 100 there are still nodes of length 1bp.

I've observed minigraph to also create small (or even single bp) nodes despite "intending" to represent L>50bp variation, so perhaps this is a more general complication of preserving graph structure still requiring some small nodes? The idea we had for generating these consensus graphs would essentially be a smaller but lossy graph that is easier to work with. In the final example of @1000, nearly 10% of nodes are 5bp or below, so still slows tools that scale with #nodes/edges rather than sequence length.

image image image

More on the original issue topic, what is the best way to join these nodes that have in/out degree=1? I've tried unsuccessfully with odgi unchop and similar approaches, but never seems to work.