vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.09k stars 193 forks source link

vg convert unable to write rGFA tags #4326

Closed Han-Cao closed 1 month ago

Han-Cao commented 1 month ago

1. What were you trying to do?

I would like to convert haplotype sampled GBZ graph to GFA, with GRCh38 path written as rGFA tags.

2. What did you want to happen?

Generate an GFA file.

3. What actually happened?

Terminate with ERROR: Signal 6

4. If you got a line like Stack trace path: /somewhere/on/your/computer/stacktrace.txt, please copy-paste the contents of that file here:

terminate called after throwing an instance of 'std::runtime_error'
  what():  error [gfa]: unable to write rGFA tags for path GRCh38#0#chr10[133748460] because node 3000412 is traversed on its reverse strand.  rGFA only supports the forward strand.

━━━━━━━━━━━━━━━━━━━━
Crash report for vg v1.57.0 "Franchini"
Stack trace (most recent call last):
#12   Object "/home/hcaoad/bin/vg", at 0x61c794, in _start
#11   Object "/home/hcaoad/bin/vg", at 0x20e2446, in __libc_start_main
#10   Object "/home/hcaoad/bin/vg", at 0x20e0ba9, in __libc_start_call_main
#9    Object "/home/hcaoad/bin/vg", at 0xe0fa8b, in vg::subcommand::Subcommand::operator()(int, char**) const
#8    Object "/home/hcaoad/bin/vg", at 0xcbd3d9, in main_convert(int, char**)
#7    Object "/home/hcaoad/bin/vg", at 0x54a892, in vg::graph_to_gfa(handlegraph::PathHandleGraph const*, std::ostream&, std::set<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, bool, bool) [clone .cold]
#6    Object "/home/hcaoad/bin/vg", at 0x201a968, in __cxa_throw
#5    Object "/home/hcaoad/bin/vg", at 0x201a806, in std::terminate()
#4    Object "/home/hcaoad/bin/vg", at 0x201a79b, in __cxxabiv1::__terminate(void (*)())
#3    Object "/home/hcaoad/bin/vg", at 0x5e8c03, in __gnu_cxx::__verbose_terminate_handler() [clone .cold]
#2    Object "/home/hcaoad/bin/vg", at 0x5eb34b, in abort
#1    Object "/home/hcaoad/bin/vg", at 0x20f9625, in raise
#0    Object "/home/hcaoad/bin/vg", at 0x212617c, in __pthread_kill
ERROR: Signal 6 occurred. VG has crashed. Visit https://github.com/vgteam/vg/issues/new/choose to report a bug.
Please include this entire error log in your bug report!

5. What data and command can the vg dev team use to make the problem happen?

I used the MC pipeline to build a pangenome graph with --reference CHM13 GRCh38, I then performed haplotype sampling using vg haplotypes --include-reference and tried to convert the sampled graph to GFA by vg convert -Q GRCh38 -t 40 -f sampled.gbz.

This can also be replicated using HPRCv1.1

vg convert -Q GRCh38 -t 40 -f hprc-v1.1-mc-chm13.gbz

what():  error [gfa]: unable to write rGFA tags for path GRCh38#0#chr10[10945] because node 6854711 is traversed on its reverse strand.  rGFA only supports the forward strand.

Everything works well if I use -Q CHM13.

6. What does running vg version say?

vg version v1.57.0 "Franchini"
Compiled with g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 on Linux
Linked against libstd++ 20230528
Built by anovak@courtyard.gi.ucsc.edu
adamnovak commented 1 month ago

It looks like to do this you would need to alter the sequences of some nodes in the graph. rGFA reference paths can only visit their nodes in the nodes' local forward orientation. In the graph you have, the GRCh38 paths sometimes visit nodes' reverse strands. I think not being able to write hprc-v1.1-mc-chm13.gbz with GRCh38 rGFA tags is expected; it is referenced on CHM13 and should be able to be written with CHM13 rGFA tags, but if there's any node with opposite orientations in CHM13 and GRCh38 then you can't write the same graph out with GRCh38 rGFA tags.

So to make the rGFA you want, you would need to flip all the nodes around so their local forward orientations match the one GRCh38 uses. I feel like we might have a method to do that with vg mod given a graph and a path name, but looking at the help it looks like either it never existed or we removed it. The closest thing is vg mod --orient-forward, which will flip around any unnecessarily reversed nodes that look like they should be flipped just based on their edge connectivity.

I have seen Cactus put nodes in backward for no apparent reason; maybe that's the source of your problem and you can fix it with this and not need to deal with inversions between GRCh38 and CHM13? If that doesn't work, and we don't already have code to make all the nodes on a path be forward, someone would need to write new code to do it (maybe by processing a non-r GFA).

Once you have the node orientations you want, you could then get an rGFA, but you wouldn't be able to safely use it with any read alignments or other data derived from the GBZ, because the node sequences would have changed. If you want to do alignments with Giraffe you'd need to make a new GBZ from the modified graph.

Also, it looks like your graph doesn't actually include the entirety of GRCh38; note that you have a path GRCh38#0#chr10[133748460] with a starting offset of 133748460 on chr10. I think this means that the base before 133748460 on GRCh38's chr10 is not actually in the graph. I think @glennhickey has sorted out rGFA tag generation so you will still get the right rGFA tags for everything that is in the graph, but rGFA is really meant to be used with a complete rank-0 path set, I think. And from your description of your method I think you meant for the full GRCh38 paths to be in there.

@glennhickey If you do --reference CHM13 GRCh38 for Minigraph-Cactus, shouldn't you end up with complete versions of both sets of paths in the graph? Also, how does Minigraph-Cactus decide which reference gets to have a node locally forward? Maybe this would Just Work if the order were swapped?

Han-Cao commented 1 month ago

Thank you so much for the explanation!

According to this answer, any >10 kbp parts of GRCh38 that don't align to CHM13 are missing from the graph. So, I think an CHM13-based graph may not have full sequence of GRCh38.

The reason I would like to generate an GRCh38-based rGFA is the SO tag is required by Pangenie pipeline to preprocess the VCF for genotyping. But if modifying the graph is required to do so, I would try other preprocessing methods or switch to CHM13 for consistency of the graph used in all analysis.

adamnovak commented 1 month ago

@Han-Cao You could also try the GRCh38-based HPRC graphs, which include all of GRCh38 and then parts of CHM13. Instead of hprc-v1.1-mc-chm13.gbz you would want hprc-v1.1-mc-grch38.gbz, which is at https://s3-us-west-2.amazonaws.com/human-pangenomics/pangenomes/freeze/freeze1/minigraph-cactus/hprc-v1.1-mc-grch38/hprc-v1.1-mc-grch38.gbz and in the GRCh38 column in the table at https://github.com/human-pangenomics/hpp_pangenome_resources?tab=readme-ov-file#minigraph-cactus

Han-Cao commented 1 month ago

Many thanks for your suggestions. I will also try the GRCh38-based graph.