pangenome / pggb

the pangenome graph builder
https://doi.org/10.1101/2023.04.05.535718
MIT License
355 stars 38 forks source link

pggb graph contains uncovered edges #299

Closed danydoerr closed 1 year ago

danydoerr commented 1 year ago

I believe that during graph construction, pggb produces edges that are not covered by any path. If this is in fact the case, then these additional edges can potentially cause false positives in subsequent analyses, hence they really shouldn't be part of the graph. Here's an example with a graph that I got from @subwaystation: Out of the 2,120,160 edges in graph https://zenodo.org/record/7937947/files/ecoli50.gfa.zst, 9167 are not covered by any path.

The numbers were produced with panacus. It can report no-coverage edges with the following command:

  1. Get the graph

    wget -c https://zenodo.org/record/7937947/files/ecoli50.gfa.zst
    zstd -d ecoli50.gfa.zst
  2. Run panacus:

    RUST_LOG=info panacus table -a -c edge -t4 ecoli50.gfa | grep -e '\s0$' | cut -f1 > ecoli50.no_coverage_edges.txt

ecoli50.no_coverage_edges.txt

ekg commented 1 year ago

That's a bad bug.

Can you tell what stage of the pipeline the problem occurs in. We've had a few instances of this with gfaffix but maybe you can tell it's not happening there? Or have checks in it now that ensure this isn't happening?

You can rerun the process and keep intermediate files -K. Then check each GFA to see when the problem arises.

And you shared data. We can play with that. We also have a check in odgi for this.

(Edit: apologies for my quick, confused, and rambling reply before!!)

subwaystation commented 1 year ago

I will share the relevant files when I can, unfortunately I can't access our cluster at the moment.

danydoerr commented 1 year ago

We've had a few instances of this with gfaffix but maybe you can tell it's not happening there?

It could be because of gfaffix -- the tool does not check whether it outputs additional unused edges. Would be great to get the intermediate files to check this.

glennhickey commented 1 year ago

Yeah, I've seen that too so minigraph-cactus always sends the output of gfaffix through vg clip -d 1 to remove uncovered edges.

See also: https://github.com/marschall-lab/GFAffix/issues/6

danydoerr commented 1 year ago

@glennhickey Yes, in your case, this bug is known--it occurs only when de-collapsing reference paths. Since pggb does not use this option, if the bug is in fact introduced by GFAffix, it must be something else.

ekg commented 1 year ago

Thanks @glennhickey, I was wondering if what you'd seen was related. We can look.

@danydoerr a check for graph integrity in gfaffix would be good. In this case it would show if gfaffix is the culprit or if there is another problem in pggb.

We probably want to verify that the paths are correctly embedded and that there aren't 0-coverage edges in the output if they aren't in the input. Not sure how to deal with the case where there are uncovered edges in the input graph. We can continue discussion on gfaffix.

danydoerr commented 1 year ago

@ekg Yes, I agree, a check in GFAffix would be good. I'll do this in the next weeks. Would it also make sense that pggb pipeline produces coverage-related sanity stats on the final graph so that the user can check this visually?

subwaystation commented 1 year ago

Hi @danydoerr, here are some links to the constructed graphs from various stages:

Just to be clear, the graphs actually contain 50 and not 100 haplotypes, there was some file naming hickup happening. The links will expire after 10 days.

danydoerr commented 1 year ago

@subwaystation Is this the order in which the tools are applied in the pipeline?

subwaystation commented 1 year ago

Yes!

danydoerr commented 1 year ago

@ekg after smoothxg the graph contains 9734 edges that I can't find in any of its paths. Here's the first edge that panacus finds (see attached file for all of them):

$ grep 'L    77      +       79      +' ecoli100.fa.gz.smoothxg.gfa
L       77      +       79      +       0M
$ grep -ce '79-,77-' ecoli100.fa.gz.smoothxg.gfa
0
$ grep -ce '77+,79+' ecoli100.fa.gz.smoothxg.gfa
0

GFAffix removes some of them, but in the end, there are still 9167 left in the final graph. ecoli100.fa.gz.smoothxg.uncovered_edges.txt

ekg commented 1 year ago

Oh fun! This can be debugged. I need to make a pass through pggb to fix a few things, some harder than this, but I can pick this up on the way.

A quick step is putting your test in the pipeline and calling warnings when we see it happen.

danydoerr commented 1 year ago

Sounds good to me. Unfortunately, I discovered a bug in panacus. Apparently it's not happy with the seqwish graph. I'll get back to you once the fixed version is available through conda.

danydoerr commented 1 year ago

Turns out that the seqwish graph contains lots of duplicated edges. for instance:

$ grep 'L    3       +       4       +' ecoli100.fa.gz.seqwish.gfa
L       3       +       4       +       0M
$ grep 'L    4       -       3       -' ecoli100.fa.gz.seqwish.gfa
L       4       -       3       -       0M

panacus assumes that each edge is unique. I'm now pushing a fix that gets rid of this assumption.

subwaystation commented 1 year ago

I think from our understanding these edges are unique. The first one is always in forward orientation, the latter one always in reverse orientation. This is important when we add the actual paths through the nodes.

danydoerr commented 1 year ago

Interesting. I don't think this assumption is made for the subsequent iterations of graphs, so smoothxg and following. To my understanding, the GFA format does not impose orientation on the links. So one should be able to traverse a link in either direction in the graph. Anyhow, the issue in panacus is solved now and the working version is already available on github/main. I will prepare a release for next week or so (after making a few further improvements).

danydoerr commented 1 year ago

@ekg FYI the seqwish graph does not contain any uncovered edges, so I assume that these are introduced in the construction of the smoothxg graph.

AndreaGuarracino commented 1 year ago

I have a fix! I'll push it soon.

danydoerr commented 1 year ago

Meraviglioso! Thanks for taking care of this.

AndreaGuarracino commented 1 year ago

I've pushed the updates (https://github.com/pangenome/smoothxg/pull/194 and https://github.com/pangenome/pggb/pull/304). With a hacked panacus (see https://github.com/marschall-lab/panacus/pull/6) I've noticed that abPOA and SPOA were introducing edges supported by no paths. After fixing this issue, the final graphs are slightly less complex (a bit fewer edges and a bit fewer nodes, but with the same graph length) and that's good news.