pangenome / odgi

Optimized Dynamic Genome/Graph Implementation: understanding pangenome graphs
https://doi.org/10.1093/bioinformatics/btac308
MIT License
191 stars 39 forks source link

HPRC graph conversion issues #551

Closed adadiehl closed 6 months ago

adadiehl commented 7 months ago

When working with both HPRC Minigraph graphs and MC graphs, converted into .og locally (see below), I have encountered a problem where any GRCh38 coordinates processed with odgi pav are reported as not present in the graph. This does not occur with the stock MC graph in og format. See commands below:

Example 1 (working):

wget https://s3-us-west-2.amazonaws.com/human-pangenomics/pangenomes/freeze/freeze1/minigraph-cactus/hprc-v1.1-mc-grch38/hprc-v1.1-mc-grch38.full.og
odgi pav -i hprc-v1.1-mc-grch38.full.og -b <(printf "GRCh38#chr9\t123346243\t123346539\n") -t24 -S

Result: image

Example 2 (broken):

wget https://s3-us-west-2.amazonaws.com/human-pangenomics/pangenomes/freeze/freeze1/minigraph-cactus/hprc-v1.1-mc-grch38/hprc-v1.1-mc-grch38.full.gfa.gz
gunzip hprc-v1.1-mc-grch38.full.gfa.gz 
odgi build -g hprc-v1.1-mc-grch38.full.gfa -o hprc-v1.1-mc-grch38.full.local.og
odgi pav -i hprc-v1.1-mc-grch38.full.local.og -b <(printf "GRCh38#chr9\t123346243\t123346539\n") -t24 -S

Result: image

Example 3 (broken):

wget https://s3-us-west-2.amazonaws.com/human-pangenomics/pangenomes/freeze/freeze1/minigraph/hprc-v1.0-minigraph-grch38.gfa.gz
cat hprc-v1.0-minigraph-grch38.gfa.gz | sed 's/s//g' | sed 's/SN:Z:chr/SN:Z:GRCh38#chr/g' > tmp.gfa
vg convert -g tmp.gfa -f -r 89 > tmp1.gfa
odgi build -g tmp.gfa -o  hprc-v1.0-minigraph-grch38.og
odgi pav -i hprc-v1.0-minigraph-grch38.og -b <(printf "GRCh38#chr9\t123346243\t123346539\n") -t24 -S

Same result as Example 2.

What's going on here and how do I fix it?

AndreaGuarracino commented 7 months ago

Can you show the name of the GRCh38 paths in the "broken" examples?

As far as I know, MC emits a clipped graph in the GFA file, but an unclipped one in the ODGI file. However, you are using C graphs based on GRCh38, so this should not be the issue here.

adadiehl commented 7 months ago

Here's an example "W" line from the second example, which is straight from the HPRC git repo:

image

I can verify that the "S" lines in the MC file contain SN tags in the format "SN:Z:GRCh38#chr*". Apparently these are not used, though.

Presumably, this file was used to generate its .og counterpart on the HPRC repo so somebody knows the answer. Would it be better to ask there?

AndreaGuarracino commented 7 months ago

odgi supports GFA 1.0 and W lines were introduced in GFA 1.1. You need to convert those W lines into P lines.

vg convert -g -f -W MC.gfa > MC.for-odgi.gfa

glennhickey commented 7 months ago

odgi doesn't support GFA W-lines. If you want to convert MC output to odgi, I'd suggest starting with the gbz, converting it to a GFA with P-lines (vg convert -fW) then importing the result into odgi. You can use the gfa intead of gbz but the conversion might be a little slower.

I have had issues in the past with odgi not working well with clipped graphs due to the large number of small paths. This is why mc defaults to just outputting the full grpahs in odgi.

adadiehl commented 7 months ago

The problem is, if I convert the W lines to P lines, all paths appear empty in the result. Here is an example:

wget https://zenodo.org/records/6983934/files/GRCh38-90c.r518.gfa.gz
zcat GRCh38-90c.r518.gfa | sed 's/s//g' | sed 's/SN:Z:chr/SN:Z:GRCh38#chr/g' > tmp.gfa
vg convert -g tmp.gfa -W -f -r 89 > tmp1.gfa
odgi build -g tmp1.gfa -o GRCh38-90c.r518.og
odgi pav -i GRCh38-90c.r518.og -b <(printf "GRCh38#chr9\t123346243\t123346539\n") -t24 -S

Result:

image

Clearly something is amiss.

The actual goal here is to use the minigraph graphs, not MC, in odgi pav queries, so need this to work with gfa as the source. Having verified odgi pav works on the .og HPRC MC graphs, I chose to look there first for answers; my use of "W" lines was to emulate their use in the gfa versions given there. I am assuming those were used to generate the .og versions, but apparently they were not used directly. Are the steps you gave above the same as were used to generate those files?

glennhickey commented 7 months ago

Don't think there's anything amiss. Minigraph graphs does not contain embedded hapltoyes, just a rGFA cover (ie no node belongs to more than one path). Any vg or odgi tool that requires path information for different samples won't be able to do much with such graphs.

adadiehl commented 7 months ago

So, if I understand what you're saying, the odgi pav result above is the expected behavior when the above conversions are applied to the minigraph graphs? How could this be correct? You state that no nodes are shared between paths in the minigraph but, to me, this implies that no alignments exist between any samples (as the odgi pav results also imply). However, we know intuitively that this is incorrect, and can verify the correct result using the MC graphs, as I have done in example 1. Are you suggesting that the (converted) minigraph graphs are simply unsuitable for use with odgi pav?

My understanding is that the minigraphs presented in the (zenodo and) github repositories were used as the feedstock for producing the MC graphs. However, I don't need basepair-level alignments for my purposes. Is there a straightforward way to convert from the rGFA graphs to an odgi format equivalent to those given for the MC graphs?

Also, this may be more of a vg question, but it's my understanding that including the -r 89 flag for vg convert will induce inclusion of all 90 paths in the original rGFA. I have verified that "P" lines exist in the converted GFA output, and that they include correctly-formatted sequence names. (Without -r 89, the odgi pav result simply indicates which reference chromosome the bed range is located on -- no output for any non-reference sample. (data not shown)) In what way are these paths different than those expected by odgi build?