pangenome / pggb

the pangenome graph builder
https://doi.org/10.1101/2023.04.05.535718
MIT License
355 stars 38 forks source link

interoperability with vg - error:[vg::SmallSnarlSimplifier] Invalid graph on iteration 0 #270

Closed colindaven closed 1 year ago

colindaven commented 1 year ago

Hi,

thanks for your efforts with PGGB and odgi first up.

I've generated quite a few graphs in PGGB and am trying to work with them in vg and with graphaligner. In general, I can map short and long reads and sometimes call SNPs, yet problems do occur.

I have read multiple issues from the vg team saying they test on minigraph-cactus graphs rather than PGGB graphs. Do you see any reason why the following error might occur, and how I might solve it ? Could vg simplify be too old, or the two ecosystems actually be somewhat incompatible ?

Workflow PGGB -> GFA per chromosome -> ODGI squeeze -> OG to GFA -> vg convert GFA to VG -> vg simplify (fails) -> vg to GFA

My goal is to simplify the graph prior to alignment to increase mapping speed and decrease downstream problems with SNP calling and SV genotyping.

Thanks


total 836132
drwxrwsr-x 2 davenpor bioinformatics        13 Jan  2 11:56 ./
drwxrwsr-x 4 davenpor bioinformatics         2 Jan  2 11:52 ../
-rw-r--r-- 1 davenpor bioinformatics 431552807 Jan  2 11:53 AT_pangenome_test.gfa
lrwxrwxrwx 1 davenpor bioinformatics        87 Jan  2 11:52 AT_pangenome_test.og 
-rw-r--r-- 1 davenpor bioinformatics      1067 Jan  2 11:53 AT_pangenome_test.og.yaml
-rw-r--r-- 1 davenpor bioinformatics         0 Jan  2 11:54 AT_pangenome_test.simp.vg
-rw-r--r-- 1 davenpor bioinformatics 424627155 Jan  2 11:54 AT_pangenome_test.vg
-rw-rw-r-- 1 davenpor bioinformatics         0 Jan  2 11:52 .command.begin
-rw-rw-r-- 1 davenpor bioinformatics       258 Jan  2 11:56 .command.err
-rw-rw-r-- 1 davenpor bioinformatics      3583 Jan  2 11:56 .command.log
-rw-rw-r-- 1 davenpor bioinformatics         0 Jan  2 11:52 .command.out
-rw-rw-r-- 1 davenpor bioinformatics     10357 Jan  2 11:52 .command.run
-rw-rw-r-- 1 davenpor bioinformatics       415 Jan  2 11:52 .command.sh
-rw-r--r-- 1 davenpor bioinformatics         0 Jan  2 11:52 .command.trace
-rw-rw-r-- 1 davenpor bioinformatics         1 Jan  2 11:56 .exitcode

## error

/1ac5a95697e10f859b0b469c956f01$ cat .command.err
graph path 'AT_Col-0_hq-v1.1.chr4' invalid: edge from 2932976 end to 2934155 end does not exist
graph path 'AT_Col-0_hq-v1.1.chr4' invalid: edge from 2934276 start to 3646278 start does not exist
error:[vg::SmallSnarlSimplifier] Invalid graph on iteration 0
glennhickey commented 1 year ago

I think vg simplify being too old is probably at the root of this particular issue. Please raise an issue on the vg github and we can ask its author @adamnovak if you can even expect it to help you here (probably not).

In general, we want vg to be fully compatible with PGGB. The main issue now (as you're aware, I'm sure) is that the topologies coming out of PGGB are often too complex for vg to index and/or map to effectively. Hopefully there'll be some progress on this in the coming months (it's on all our radars). Unfortunately I don't think vg simplify, at least in its current state, is the answer. Apart from this issue with mapping, all the vg tools should work (ie for graph maninpulation and stats) should work with PGGB graphs (related - vg can operate directly on GFA without explicitly needing to convert).

BaxW commented 1 year ago

In cases when the topology of a PGGB graph is too complex for vg to index or map, does it make any sense to use the consensus graphs output by pggb -C for mapping with vg map or vg giraffe or would that violate the assumptions these mappers make about the paths present in the graph?

subwaystation commented 1 year ago

Theoretically you could use such a consensus graph, however the paths as well as the nodes are different from the original graph. You would have to write some lines of code to translate the mapping from the node space in the consensus graph to the nodes space in the full graph.

:warning: We currently disabled the consensus graph generation in pggb because of https://github.com/pangenome/pggb/issues/133.

adamnovak commented 1 year ago

It looks like vg simplify is bailing out because its input graph (i.e. the imported PGGB graph) doesn't satisfy some of the invariants it tries to maintain on every iteration to check its own correctness. (Hence the "iteration 0".)

Specifically, the path AT_Col-0_hq-v1.1.chr4 jumps between the end of node 2932976 and the end of node 2934155, and also between the start of node 2934276 and the start of node 3646278, but there aren't edges in the graph to allow this; the path is not actually restricting itself to the set of paths possible in the graph.

To fix this, you could create those missing edges (maybe by editing new L lines into the GFA), or you would somehow cut the path AT_Col-0_hq-v1.1.chr4 into pieces at those points (I guess with your own script), or you could adjust the pipeline upstream of vg simplify to not generate paths that leave the graph like that.

You can check with vg validate on a vg or GFA graph to make sure that this constraint is followed; it will complain if it is not.

adamnovak commented 1 year ago

Oh sorry, I'm way behind the thread here, this is known already.

Still no lead on why the PGGB and/or ODGI squeeze output might be missing edges? @ekg do you try to enforce the constraint that paths must follow graph edges in those tools?

ekg commented 1 year ago

This is definitely a bug. It should be that all edges are covered by paths and all paths only traverse real edges.

My guess is that this is a problem with odgi squeeze. We are making some stronger checks in smoothxg to confirm that everything is embedded.

There is also a slight possibility that it's another downstream step like gfaffix that's introduced a problem.

On Tue, Feb 21, 2023, 15:14 Adam Novak @.***> wrote:

Oh sorry, I'm way behind the thread here, this is known already.

Still no lead on why the PGGB and/or ODGI squeeze output might be missing edges? @ekg https://github.com/ekg do you try to enforce the constraint that paths must follow graph edges in those tools?

— Reply to this email directly, view it on GitHub https://github.com/pangenome/pggb/issues/270#issuecomment-1439102025, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQENDXRHVGJO4UQXUGQLWYUV3VANCNFSM6AAAAAATOYGECM . You are receiving this because you were mentioned.Message ID: @.***>

subwaystation commented 1 year ago

Pinging @danydoerr. He mentioned to me that in previous gfaffix versions the graph may not have been perfectly valid. Maybe this is the case here?

danydoerr commented 1 year ago

The bug is in GFAffix 1.2.x, but was fixed in March 2022. Did you run GFAffix with -c? If GFAffix runs through with this setting, the bug is probably not caused by it.

colindaven commented 1 year ago

I think, like @ekg , that the bug must be in odgi squeeze.

Going back to the first odgi files from PGGB, I can validate the chr4.vg successfully, but get an error for chr4 after doing odgi squeeze.

PGGB -> OG per chromosome -> ODGI squeeze -> OG to GFA -> vg convert GFA to VG -> vg validate

In code


  # convert og to gfa
  odgi view -i chr5.og -g -t 10  > chr5.gfa
  odgi view -i chr4.og -g -t 10  > chr4.gfa
  odgi view -i AT_pangenome_test.og -g -t 10  > AT_pangenome_test.gfa     (all chr, via odgi squeeze )

  # gfa to vg
  vg convert -t 8 -g -v chr5.gfa > chr5.vg
  vg convert -t 8 -g -v chr4.gfa > chr4.vg
  vg convert -t 8 -g -v AT_pangenome_test.gfa > AT_pangenome_test.vg

  # check file
  vg validate chr5.vg  -ok
  vg validate chr4.vg   -ok
  vg validate AT_pangenome_test.vg    -FAILS (all chr, via odgi squeeze)

errors:

vg convert -t 8 -g -v AT_pangenome_test.gfa > AT_pangenome_test.vg
vg validate AT_pangenome_test.vg

graph invalid: missing edge between 50276th step (2932976:0) and 50277th step (2934155:1) of path AT_Col-0_hq-v1.1.chr4
graph invalid: missing edge between 50277th step (2934155:0) and 50276th step (2932976:0) of path AT_Col-0_hq-v1.1.chr4
graph invalid: missing edge between 102606th step (2934276:1) and 102607th step (3646278:0) of path AT_Col-0_hq-v1.1.chr4
graph invalid: missing edge between 102607th step (3646278:1) and 102606th step (2934276:1) of path AT_Col-0_hq-v1.1.chr4
graph: invalid

# but chr4 appears fine if validated by itself without using odgi squeeze

  odgi view -i chr4.og -g -t 10  > chr4.gfa
  vg convert -t 8 -g -v chr4.gfa > chr4.vg
  vg validate chr4.vg
graph: valid
ekg commented 1 year ago

Odgi squeeze needs to be rewritten to avoid this pattern.

And we should have validation of the graph as a standard step in the pipeline.

On Fri, Feb 24, 2023, 03:45 Colin Davenport @.***> wrote:

I think, like @ekg https://github.com/ekg , that the bug must be in odgi squeeze.

Going back to the first odgi files from PGGB, I can validate the chr4.vg successfully, but get an error for chr4 after doing odgi squeeze.

PGGB -> OG per chromosome -> ODGI squeeze -> OG to GFA -> vg convert GFA to VG -> vg validate

In code

convert og to gfa

odgi view -i chr5.og -g -t 10 > chr5.gfa odgi view -i chr4.og -g -t 10 > chr4.gfa odgi view -i AT_pangenome_test.og -g -t 10 > AT_pangenome_test.gfa (all chr, via odgi squeeze )

gfa to vg

vg convert -t 8 -g -v chr5.gfa > chr5.vg vg convert -t 8 -g -v chr4.gfa > chr4.vg vg convert -t 8 -g -v AT_pangenome_test.gfa > AT_pangenome_test.vg

check file

vg validate chr5.vg -ok vg validate chr4.vg -ok vg validate AT_pangenome_test.vg -FAILS (all chr, via odgi squeeze)

errors:

vg convert -t 8 -g -v AT_pangenome_test.gfa > AT_pangenome_test.vg vg validate AT_pangenome_test.vg

graph invalid: missing edge between 50276th step (2932976:0) and 50277th step (2934155:1) of path AT_Col-0_hq-v1.1.chr4 graph invalid: missing edge between 50277th step (2934155:0) and 50276th step (2932976:0) of path AT_Col-0_hq-v1.1.chr4 graph invalid: missing edge between 102606th step (2934276:1) and 102607th step (3646278:0) of path AT_Col-0_hq-v1.1.chr4 graph invalid: missing edge between 102607th step (3646278:1) and 102606th step (2934276:1) of path AT_Col-0_hq-v1.1.chr4 graph: invalid

but chr4 appears fine if validated by itself without using odgi squeeze

odgi view -i chr4.og -g -t 10 > chr4.gfa vg convert -t 8 -g -v chr4.gfa > chr4.vg vg validate chr4.vg graph: valid

— Reply to this email directly, view it on GitHub https://github.com/pangenome/pggb/issues/270#issuecomment-1443367867, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEIMQHO74QHC73WTKOTWZB7MDANCNFSM6AAAAAATOYGECM . You are receiving this because you were mentioned.Message ID: @.***>

AndreaGuarracino commented 1 year ago

@colindaven, which version of odgi are you using for the squeezing? I had to fix a bug in odgi squeeze last year that was leading to invalid graph (https://github.com/pangenome/odgi/pull/411). It seems there is something more to be fixed.

Moreover, do you have a 'small enough` example to deterministically reproduce the bug?

colindaven commented 1 year ago

@AndreaGuarracino it was the latest odgi release 0.8.2 I think, I repeated it last Monday and got the same problem.

Please try to squeeze these five odgi files from 3 A. thaliana genomes to reproduce the error. I think you'll have to squeeze all 5 chromosomes, just a minimal version using chr3 and chr4 did not cause an error for me.

https://drive.google.com/drive/folders/16zh4UrcMXqpNo6lD3CaZqmE5OCQPZwPI?usp=sharing

Hope you can reproduce this.

AndreaGuarracino commented 1 year ago

@colindaven thank you for the files. I was not able to reproduce the issue. I am using the odgi version that corresponds to the commit bfae0b3c70a0f044d7ce8510b7a45f59d5138626. Please check what I did:

odgi version
  v0.8.2-92-gbfae0b3
vg version
  vg version v1.40.0 "Suardi"

cat to_squeeze.txt
  chr1.og
  chr2.og
  chr3.og
  chr4.og
  chr5.og

odgi squeeze -f to_squeeze.txt -o all.og -t 48 -P
odgi validate -i all.og -t 48 -P # no problems here
odgi view -i all.og -g > all.gfa

vg convert -t 8 -g -v all.gfa > all.vg
vg validate all.vg # no problems here
  graph: valid
colindaven commented 1 year ago

@AndreaGuarracino Thanks for this.

Yes, you're correct, the odgi I originally used was from a pipeline with a container written before you fixed the bug in 2022. I can confirm the current bioconda release of odgi squeeze now runs without problems on this data.

Its good that odgi validate will pick up any incorrect graphs without needing a gfa-> VG conversion. I've upgraded a mapping pipeline to check any problems with inputted odgi files right at the start to avoid these errors in future.

Thanks again.