vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.09k stars 193 forks source link

GAF nodes do not correspond to VCF nodes #4340

Open m-bogaerts opened 1 month ago

m-bogaerts commented 1 month ago

Hello again,

Thank you very much for updating and answering all the issues regarding vg.

I have a problem regarding the GAF and VCF files. I am trying to match the nodes from a specific genotype at the VCF to look for them at the GAF file and know the reads matching with the specific genotype. However, I cannot find nor the nodes from the VCF at the GAF file neither the nodes from the GAF at the VCF. For example:

From the VCF I cannot find ">24579>24578>24575>24574,>24579>24578>24577>24574" or ">24579<24576>24575>24574" at the GAF:

chr1    36503   >24579>24574    TT      TC,CT   60      .       AC=1,1;AF=0.5,0.5;AN=2;A
T=>24579>24578>24575>24574,>24579>24578>24577>24574,>24579<24576>24575>24574;NS=2
GT      .       1       .       2

From the GAF I cannot find ">56241690>56241691>56241693>56241694>56241695>56241696>56241698>56241699>56241701>56241702>56241704>56241705>56241707>56241708>56241710>56241711>56241713" at the VCF:

D00562:114:C94WFANXX:7:1111:1246:2052   125     0       125     +       >56241690>56241691>56241693>56241694>56241695>56241696>56241698>56241699>56241701>56241702>56241704>56241705>56241707>56241708>56241710>56241711>56241713       146     21      145     125     125     60      AS:i:135        bq:Z:ABBABFGGGGGGDFFFCCGGDGDFGEFBFC9FGGGGGGFGGGCGGGGGGG<FGGGG1F<11BFDG1E=1=F19FC0=<FGFCGGGC0BGG@DCDFGEGGGGG>GFG>FGG>G0CFC00<FGGGGG      cs:Z::125       dv:f:0  fn:Z:D00562:114:C94WFANXX:7:1111:1246:2052      pd:b:1

Since I am using the same graph (gbz) for creating both of those files, I am not sure whether I am looking for the wrong information, or there could be a problem in my pipeline?

Thank you very much in advance.

/1.48.0
glennhickey commented 1 month ago

There's no guarantee that every path in the VCF is in the GAF (or vice versa). But if absolutely nothing is matching, they could have incompatible IDs depending on the options used.

For example if the vcf was created with deconstruct -O (default in minigraph-cactus), then the IDs will correspond to the GFA (and not GBZ). The same problem could happen if deconstruct was run without -O but giraffe was run with --named-coordinates.

m-bogaerts commented 1 month ago

Hello,

Thank you very much for your answer.

I used the minigraph-cactus so I guess the IDs will correspond to the GFA? To solve it I was trying to do:

vg gbwt --gbz-format -g hax-pg/hax-pg.gbz -G hax-pg/hax-pg.gfa.gz
vg index -j hax-pg/hax-pg.dist hax-pg/hax-pg.gfa.gz
vg minimizer -p -d hax-pg/hax-pg.dist -o hax-pg/hax-pg.min hax-pg/hax-pg.gbz

But I get the following errors:

terminate called after throwing an instance of 'std::runtime_error'
  what():  GFAFile: ^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A^A-line 416 ended after record type
━━━━━━━━━━━━━━━━━━━━
Crash report for vg v1.48.0 "Gallipoli"
Stack trace (most recent call last):
#15   Object "/usr/local/bioinfo/src/vg/vg-v1.48.0/vg", at 0x5f0e3d, in _start
#14   Object "/usr/local/bioinfo/src/vg/vg-v1.48.0/vg", at 0x1ed71af, in __libc_start_main
#13   Object "/usr/local/bioinfo/src/vg/vg-v1.48.0/vg", at 0x5c0ade, in main
#12   Object "/usr/local/bioinfo/src/vg/vg-v1.48.0/vg", at 0xd3143b, in vg::subcommand::Subcommand::operator()(int, char**) const
#11   Object "/usr/local/bioinfo/src/vg/vg-v1.48.0/vg", at 0xd7c8dd, in main_gbwt(int, char**)
#10   Object "/usr/local/bioinfo/src/vg/vg-v1.48.0/vg", at 0xd79f2a, in step_1_build_gbwts(vg::GBWTHandler&, GraphHandler&, GBWTConfig&)
#9    Object "/usr/local/bioinfo/src/vg/vg-v1.48.0/vg", at 0x1555156, in gbwtgraph::gfa_to_gbwt(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, gbwtgraph::GFAParsingParameters const&)
#8    Object "/usr/local/bioinfo/src/vg/vg-v1.48.0/vg", at 0x1553b32, in gbwtgraph::GFAFile::GFAFile(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool)
#7    Object "/usr/local/bioinfo/src/vg/vg-v1.48.0/vg", at 0x1551cd7, in gbwtgraph::GFAFile::add_s_line(char const*, unsigned long)
#6    Object "/usr/local/bioinfo/src/vg/vg-v1.48.0/vg", at 0x57a727, in gbwtgraph::GFAFile::check_field(gbwtgraph::GFAFile::field_type const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool) [clone .cold]
#5    Object "/usr/local/bioinfo/src/vg/vg-v1.48.0/vg", at 0x1e13228, in __cxa_throw
#4    Object "/usr/local/bioinfo/src/vg/vg-v1.48.0/vg", at 0x1e130c6, in std::terminate()
#3    Object "/usr/local/bioinfo/src/vg/vg-v1.48.0/vg", at 0x1e1305b, in __cxxabiv1::__terminate(void (*)())
#2    Object "/usr/local/bioinfo/src/vg/vg-v1.48.0/vg", at 0x5bd66a, in __gnu_cxx::__verbose_terminate_handler() [clone .cold]
#1    Object "/usr/local/bioinfo/src/vg/vg-v1.48.0/vg", at 0x5c0007, in abort
#0    Object "/usr/local/bioinfo/src/vg/vg-v1.48.0/vg", at 0x149611b, in raise
ERROR: Signal 6 occurred. VG has crashed. Visit https://github.com/vgteam/vg/issues/new/choose to report a bug.
Please include this entire error log in your bug report!
━━━━━━━━━━━━━━━━━━━━
error[load_proto_to_graph]: invalid Graph message
Loading input graph from hax-pg/hax-pg.gbz
Loading SnarlDistanceIndex from hax-pg/hax-pg.dist
Building MinimizerIndex with k = 29, w = 11

Do you know what could be happening?

Sorry for the inconveniences and thank you in advance!

glennhickey commented 3 weeks ago

vg gbwt --gbz-format -g hax-pg/hax-pg.gbz -G hax-pg/hax-pg.gfa.gz

I don't think vg accepts compressed gfa input... try passing in a .gfa instead.