Closed Robin-Rounthwaite closed 3 months ago
Note: this original graph was created using cactus-pangenome
. The command ran was this, or else very similar to this:
(venv-cactus-v2.7.1) rrounthw@mustard:/private/groups/patenlab/rrounthw/nygc/chr19/sim-hg002-reads$ cactus-pangenome ./js ./chr19-hg002-seqfile.txt --outDir ./chr19-hg002-graph/ --outName chr19-hg002 --reference GRCH38 --xg full clip --gbz full clip
Another note: this isn't a blocking issue for me, since it turned out I didn't need to edit this particular graph. Still seems worth looking into though!
I think the problem is that that last field (the phase block) can't really be represented by a GFA W line. The W line has start and end coordinates to distinguish multiple pieces of a given haplotype of a contig for a sample, but those are meant to hold actual coordinates and not just a phase block number.
On the output side, we get the phase block from the path but never write it to the file: https://github.com/vgteam/vg/blob/2217054be1c2debfb494607ce0445fbdbbfd11e2/src/gfa.cpp#L186
On the input side, we can't read a phase block from a W line: https://github.com/vgteam/vg/blob/2217054be1c2debfb494607ce0445fbdbbfd11e2/src/algorithms/gfa_to_handle.cpp#L228-L229
Right now we're just dropping it on output and making duplicate W lines, which isn't correct. Maybe what we should do is write paths like this as P lines instead, where we can represent them in a way we can round-trip.
I think @jltsiren does something like just put the phase block number in the start position field and ignore that this might create overlapping GFA W lines. But that only workd because the GBWTGraph GFA reader doesn't try to interpret the GFA subrange coordinates as actual subranges; the reader in vg itself does, so if we do that we'd read the result back as coordinates and not round-trip properly.
@Robin-Rounthwaite Why are your paths using what looks like a coordinate in the phase block field, instead of using a real subrange? Instead of:
hg002#2#JAHKSD010000042.1#0
hg002#2#JAHKSD010000042.1#29382422
hg002#2#JAHKSD010000042.1#52728260
We could instead have:
hg002#2#JAHKSD010000042.1[0]
hg002#2#JAHKSD010000042.1[29382422]
hg002#2#JAHKSD010000042.1[52728260]
And those can round trip through GFA just fine.
I think maybe we cram the start coordinate into the phase block field (or use it to invent a disambiguating phase block) when we have to go through GBWT, which can't represent a start coordinate. @jltsiren Could we add separate storage for subranges to the GBWT metadata so we could actually tell when we have coordinates? Right now we can't take a haplotype from a GBZ, cut it in half, store it back to a new GBZ, and always be able to get the right names for the two halves.
To answer your question @adamnovak, the paths have the final pound-sign-affix added when cactus creates the graph. The command I ran to generate the graph was as follows:
cactus-pangenome ./js ./chr19-hg002-seqfile.txt --outDir ./chr19-hg002-graph/ --outName chr19-hg002 --reference GRCH38 --xg full clip --gbz full clip
and the full
graph has "#0" on the end of all the paths. The clip
graph breaks the paths into multiple parts, producing the various non-zero affixes you see here.
@glennhickey Should Cactus maybe produce its clip graphs using subrange paths and not phase block paths?
@adamnovak I think we should start by standardizing how path metadata is stored in GFA, at least in HPRC context. Once that is done, we can determine how to store the information in our graphs. I don't want to make breaking changes on GBWT file formats until I'm fairly sure the change will be final.
Cactus starts by making a W-line VCF
zcat /private/groups/patenlab/rrounthw/nygc/chr19/sim-hg002-reads/chr19-hg002-graph/chr19-hg002.gfa.gz | grep ^W | awk '{print $1 "\t" $2 "\t" $3 "\t" $4 "\t" $5 "\t" $6 }'
W GRCH38 0 chr19 0 58617616
W HG002 1 JAHKSE010000050.1 655 19770154
W HG002 1 JAHKSE010000056.1 99 8807366
W HG002 1 JAHKSE010000058.1 16 16236076
W HG002 1 JAHKSE010000061.1 1298003 10444010
W HG002 1 JAHKSE010000139.1 4636 2908791
W hg002 2 JAHKSD010000015.1 1108 947532
W hg002 2 JAHKSD010000042.1 0 25129638
W hg002 2 JAHKSD010000042.1 29382422 52712231
W hg002 2 JAHKSD010000042.1 52728260 56714188
W hg002 2 JAHKSD010000079.1 0 2895175
W hg002 2 JAHKSD010000133.1 5264 708329
W hg002 2 JAHKSD010000195.1 6354 59606
W hg002 2 JAHKSD010000425.1 0 26954
Then it makes a GBZ out of that (with vg gbwt -G --gbz-format -g
)
vg paths -Lx /private/groups/patenlab/rrounthw/nygc/chr19/sim-hg002-reads/chr19-hg002-graph/chr19-hg002.gbz
GRCH38#0#chr19
HG002#1#JAHKSE010000050.1#655
HG002#1#JAHKSE010000056.1#99
HG002#1#JAHKSE010000058.1#16
HG002#1#JAHKSE010000061.1#1298003
HG002#1#JAHKSE010000139.1#4636
hg002#2#JAHKSD010000015.1#1108
hg002#2#JAHKSD010000042.1#0
hg002#2#JAHKSD010000042.1#29382422
hg002#2#JAHKSD010000042.1#52728260
hg002#2#JAHKSD010000079.1#0
hg002#2#JAHKSD010000133.1#5264
hg002#2#JAHKSD010000195.1#6354
hg002#2#JAHKSD010000425.1#0
Then an XG out of the gbz with vg convert -x
vg paths -Lx /private/groups/patenlab/rrounthw/nygc/chr19/sim-hg002-reads/chr19-hg002-graph/chr19-hg002.xg
GRCH38#0#chr19
HG002#1#JAHKSE010000050.1#655
HG002#1#JAHKSE010000056.1#99
HG002#1#JAHKSE010000058.1#16
HG002#1#JAHKSE010000061.1#1298003
HG002#1#JAHKSE010000139.1#4636
hg002#2#JAHKSD010000015.1#1108
hg002#2#JAHKSD010000042.1#0
hg002#2#JAHKSD010000042.1#29382422
hg002#2#JAHKSD010000042.1#52728260
hg002#2#JAHKSD010000079.1#0
hg002#2#JAHKSD010000133.1#5264
hg002#2#JAHKSD010000195.1#6354
hg002#2#JAHKSD010000425.1#0
As shown above, the path range start is transformed into a phase block in the GBZ conversion (which as far as I remember is normal since GBZ doesn't have a start field).
1. What were you trying to do? Convert an xg graph to gfa, make a small edit, and then convert to pg.
2. What did you want to happen? I wanted a pg graph with all the paths included.
3. What actually happened? Any time there was a path with a common root but a different start position, only one of the paths were kept. Example of paths (from vg paths -Lx) with the same common root:
You can tell that this happened because I got this message during conversion from gfa -> pg:
And the output of
vg paths -Lx
shows only one path of each root:5. What data and command can the vg dev team use to make the problem happen? The original graph I used is at this position in mustard:
The following commands should produce this issue:
6. What does running
vg version
say?