vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.07k stars 191 forks source link

VG convert to gfa and back will drop path data. #4240

Closed Robin-Rounthwaite closed 3 months ago

Robin-Rounthwaite commented 3 months ago

1. What were you trying to do? Convert an xg graph to gfa, make a small edit, and then convert to pg.

2. What did you want to happen? I wanted a pg graph with all the paths included.

3. What actually happened? Any time there was a path with a common root but a different start position, only one of the paths were kept. Example of paths (from vg paths -Lx) with the same common root:

hg002#2#JAHKSD010000042.1#0
hg002#2#JAHKSD010000042.1#29382422
hg002#2#JAHKSD010000042.1#52728260

You can tell that this happened because I got this message during conversion from gfa -> pg:

rrounthw@mustard:/private/groups/patenlab/rrounthw/nygc/chr19/sim-hg002-reads/chr19-hg002-graph$ vg convert -p chr19-hg002.added-pound-sign-to-ref.gfa > [chr19-hg002.added-pound-sign-to-ref.pg](http://chr19-hg002.added-pound-sign-to-ref.pg/)
warning:[GFAParser] Skipping GFA W line: GFA format error: On pass 1: On line 424042: Duplicate path hg002#2#JAHKSD010000042.1#0 exists in graph
warning:[GFAParser] Skipping GFA W line: GFA format error: On pass 1: On line 424043: Duplicate path hg002#2#JAHKSD010000042.1#0 exists in graph

And the output of vg paths -Lx shows only one path of each root:

hg002#2#JAHKSD010000042.1#0

5. What data and command can the vg dev team use to make the problem happen? The original graph I used is at this position in mustard:

/private/groups/patenlab/rrounthw/nygc/chr19/sim-hg002-reads/chr19-hg002-graph/chr19-hg002.xg

The following commands should produce this issue:

vg convert -f chr19-hg002.xg > chr19-hg002.gfa
vg convert -p chr19-hg002.gfa > [chr19-hg002.pg](http://chr19-hg002.pg/)

6. What does running vg version say?

vg version v1.53.0-270-ga84af9ff7 "Valmontone"
Compiled with g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 on Linux
Linked against libstd++ 20230528
Built by rrounthw@mustard
Robin-Rounthwaite commented 3 months ago

Note: this original graph was created using cactus-pangenome. The command ran was this, or else very similar to this:

(venv-cactus-v2.7.1) rrounthw@mustard:/private/groups/patenlab/rrounthw/nygc/chr19/sim-hg002-reads$ cactus-pangenome ./js ./chr19-hg002-seqfile.txt --outDir ./chr19-hg002-graph/ --outName chr19-hg002 --reference GRCH38 --xg full clip --gbz full clip
Robin-Rounthwaite commented 3 months ago

Another note: this isn't a blocking issue for me, since it turned out I didn't need to edit this particular graph. Still seems worth looking into though!

adamnovak commented 3 months ago

I think the problem is that that last field (the phase block) can't really be represented by a GFA W line. The W line has start and end coordinates to distinguish multiple pieces of a given haplotype of a contig for a sample, but those are meant to hold actual coordinates and not just a phase block number.

On the output side, we get the phase block from the path but never write it to the file: https://github.com/vgteam/vg/blob/2217054be1c2debfb494607ce0445fbdbbfd11e2/src/gfa.cpp#L186

On the input side, we can't read a phase block from a W line: https://github.com/vgteam/vg/blob/2217054be1c2debfb494607ce0445fbdbbfd11e2/src/algorithms/gfa_to_handle.cpp#L228-L229

Right now we're just dropping it on output and making duplicate W lines, which isn't correct. Maybe what we should do is write paths like this as P lines instead, where we can represent them in a way we can round-trip.

I think @jltsiren does something like just put the phase block number in the start position field and ignore that this might create overlapping GFA W lines. But that only workd because the GBWTGraph GFA reader doesn't try to interpret the GFA subrange coordinates as actual subranges; the reader in vg itself does, so if we do that we'd read the result back as coordinates and not round-trip properly.

@Robin-Rounthwaite Why are your paths using what looks like a coordinate in the phase block field, instead of using a real subrange? Instead of:

hg002#2#JAHKSD010000042.1#0
hg002#2#JAHKSD010000042.1#29382422
hg002#2#JAHKSD010000042.1#52728260

We could instead have:

hg002#2#JAHKSD010000042.1[0]
hg002#2#JAHKSD010000042.1[29382422]
hg002#2#JAHKSD010000042.1[52728260]

And those can round trip through GFA just fine.

I think maybe we cram the start coordinate into the phase block field (or use it to invent a disambiguating phase block) when we have to go through GBWT, which can't represent a start coordinate. @jltsiren Could we add separate storage for subranges to the GBWT metadata so we could actually tell when we have coordinates? Right now we can't take a haplotype from a GBZ, cut it in half, store it back to a new GBZ, and always be able to get the right names for the two halves.

Robin-Rounthwaite commented 3 months ago

To answer your question @adamnovak, the paths have the final pound-sign-affix added when cactus creates the graph. The command I ran to generate the graph was as follows:

cactus-pangenome ./js ./chr19-hg002-seqfile.txt --outDir ./chr19-hg002-graph/ --outName chr19-hg002 --reference GRCH38 --xg full clip --gbz full clip

and the full graph has "#0" on the end of all the paths. The clip graph breaks the paths into multiple parts, producing the various non-zero affixes you see here.

adamnovak commented 3 months ago

@glennhickey Should Cactus maybe produce its clip graphs using subrange paths and not phase block paths?

jltsiren commented 3 months ago

@adamnovak I think we should start by standardizing how path metadata is stored in GFA, at least in HPRC context. Once that is done, we can determine how to store the information in our graphs. I don't want to make breaking changes on GBWT file formats until I'm fairly sure the change will be final.

glennhickey commented 3 months ago

Cactus starts by making a W-line VCF

zcat /private/groups/patenlab/rrounthw/nygc/chr19/sim-hg002-reads/chr19-hg002-graph/chr19-hg002.gfa.gz | grep ^W | awk '{print $1 "\t" $2 "\t" $3 "\t" $4 "\t" $5 "\t" $6 }' 
W   GRCH38  0   chr19   0   58617616
W   HG002   1   JAHKSE010000050.1   655 19770154
W   HG002   1   JAHKSE010000056.1   99  8807366
W   HG002   1   JAHKSE010000058.1   16  16236076
W   HG002   1   JAHKSE010000061.1   1298003 10444010
W   HG002   1   JAHKSE010000139.1   4636    2908791
W   hg002   2   JAHKSD010000015.1   1108    947532
W   hg002   2   JAHKSD010000042.1   0   25129638
W   hg002   2   JAHKSD010000042.1   29382422    52712231
W   hg002   2   JAHKSD010000042.1   52728260    56714188
W   hg002   2   JAHKSD010000079.1   0   2895175
W   hg002   2   JAHKSD010000133.1   5264    708329
W   hg002   2   JAHKSD010000195.1   6354    59606
W   hg002   2   JAHKSD010000425.1   0   26954

Then it makes a GBZ out of that (with vg gbwt -G --gbz-format -g)

vg paths -Lx /private/groups/patenlab/rrounthw/nygc/chr19/sim-hg002-reads/chr19-hg002-graph/chr19-hg002.gbz
GRCH38#0#chr19
HG002#1#JAHKSE010000050.1#655
HG002#1#JAHKSE010000056.1#99
HG002#1#JAHKSE010000058.1#16
HG002#1#JAHKSE010000061.1#1298003
HG002#1#JAHKSE010000139.1#4636
hg002#2#JAHKSD010000015.1#1108
hg002#2#JAHKSD010000042.1#0
hg002#2#JAHKSD010000042.1#29382422
hg002#2#JAHKSD010000042.1#52728260
hg002#2#JAHKSD010000079.1#0
hg002#2#JAHKSD010000133.1#5264
hg002#2#JAHKSD010000195.1#6354
hg002#2#JAHKSD010000425.1#0

Then an XG out of the gbz with vg convert -x

vg paths -Lx /private/groups/patenlab/rrounthw/nygc/chr19/sim-hg002-reads/chr19-hg002-graph/chr19-hg002.xg
GRCH38#0#chr19
HG002#1#JAHKSE010000050.1#655
HG002#1#JAHKSE010000056.1#99
HG002#1#JAHKSE010000058.1#16
HG002#1#JAHKSE010000061.1#1298003
HG002#1#JAHKSE010000139.1#4636
hg002#2#JAHKSD010000015.1#1108
hg002#2#JAHKSD010000042.1#0
hg002#2#JAHKSD010000042.1#29382422
hg002#2#JAHKSD010000042.1#52728260
hg002#2#JAHKSD010000079.1#0
hg002#2#JAHKSD010000133.1#5264
hg002#2#JAHKSD010000195.1#6354
hg002#2#JAHKSD010000425.1#0

As shown above, the path range start is transformed into a phase block in the GBZ conversion (which as far as I remember is normal since GBZ doesn't have a start field).