vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.1k stars 194 forks source link

Invalid Haplotype Field Error During pggb to gbz Conversion Due to Incorrect Regex in P-line #4097

Open sloth-eat-pudding opened 1 year ago

sloth-eat-pudding commented 1 year ago

1. What were you trying to do?

I was attempting to convert a pggb pangenome graph to gbz format for use in Giraffe.

2. What did you want to happen?

I wanted a direct conversion without issues.

3. What actually happened?

I encountered an error stating what(): MetadataBuilder: Invalid haplotype field JAHBCA010000258.1.

4. If you got a line like Stack trace path: /somewhere/on/your/computer/stacktrace.txt, please copy-paste the contents of that file here:

root@73614c8feec4:/# vg gbwt -G hprc-v1.0-pggb.gfa --gbz-format -g hprc-v1.0-pggb-all-gbwt.gbz
terminate called after throwing an instance of 'std::runtime_error'
  what():  MetadataBuilder: Invalid haplotype field JAHBCA010000258.1
━━━━━━━━━━━━━━━━━━━━
Crash report for vg v1.50.1 "Monopoli"
Stack trace (most recent call last):
#15   Object "/vg/bin/vg", at 0x5f2fcd, in _start
#14   Object "/vg/bin/vg", at 0x1ef638f, in __libc_start_main
#13   Object "/vg/bin/vg", at 0x5c2d3e, in main
#12   Object "/vg/bin/vg", at 0xd694fb, in vg::subcommand::Subcommand::operator()(int, char**) const
#11   Object "/vg/bin/vg", at 0xdb241d, in main_gbwt(int, char**)
#10   Object "/vg/bin/vg", at 0xdafa6a, in step_1_build_gbwts(vg::GBWTHandler&, GraphHandler&, GBWTConfig&)
#9    Object "/vg/bin/vg", at 0x1575cf3, in gbwtgraph::gfa_to_gbwt(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, gbwtgraph::GFAParsingParameters const&)
#8    Object "/vg/bin/vg", at 0x156a8df, in gbwtgraph::parse_metadata(gbwtgraph::GFAFile const&, std::vector<gbwtgraph::ConstructionJob, std::allocator<gbwtgraph::ConstructionJob> > const&, gbwtgraph::MetadataBuilder&, gbwtgraph::GFAParsingParameters const&)
#7    Object "/vg/bin/vg", at 0x1566e20, in gbwtgraph::GFAFile::for_these_path_names(std::vector<char const*, std::allocator<char const*> > const&, std::function<void (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)> const&) const
#6    Object "/vg/bin/vg", at 0x57f406, in gbwtgraph::MetadataBuilder::add_path(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long) [clone .cold]
#5    Object "/vg/bin/vg", at 0x1e32408, in __cxa_throw
#4    Object "/vg/bin/vg", at 0x1e322a6, in std::terminate()
#3    Object "/vg/bin/vg", at 0x1e3223b, in __cxxabiv1::__terminate(void (*)())
#2    Object "/vg/bin/vg", at 0x5bf8ca, in __gnu_cxx::__verbose_terminate_handler() [clone .cold]
#1    Object "/vg/bin/vg", at 0x5c2267, in abort
#0    Object "/vg/bin/vg", at 0x14b64cb, in raise
ERROR: Signal 6 occurred. VG has crashed. Visit https://github.com/vgteam/vg/issues/new/choose to report a bug.
Please include this entire error log in your bug report!
━━━━━━━━━━━━━━━━━━━━

5. What data and command can the vg dev team use to make the problem happen?

Data: hprc-v1.0-pggb.gfa Command: vg gbwt -G hprc-v1.0-pggb.gfa --gbz-format -g hprc-v1.0-pggb-all-gbwt.gbz

6. What does running vg version say?

vg version v1.51.0 "Quellenhof"
Compiled with g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 on Linux
Linked against libstd++ 20230528
Built by root@805a61b04cee

I suspect the issue originates from the incorrect regex on the P-line. In the hprc-v1.0-pggb file, the P-line contains additional MT information. The regex pattern is (.*)#(.*)#(.*). So, when given P HG00438#2#JAHBCA010000258.1#MT, it splits it into [HG00438#2][JAHBCA010000258.1][MT]. The second piece of information should be the haplotype. As a result, it attempts to convert JAHBCA010000258.1 into a number, causing the error. I found this regex pattern defined in /vg/deps/gbwtgraph/src/gfa.cpp as const std::string GFAParsingParameters::PAN_SN_REGEX = "(.*)#(.*)#(.*)";. I hope this information is helpful to you.

jltsiren commented 1 year ago

The path names in that graph are not in any format VG supports by default:

  1. sample#haplotype#contig#fragment
  2. sample#haplotype#contig#interval
  3. sample#haplotype#contig (PanSN format)
  4. name#fragment or name#interval
  5. name

The regex for the third pattern matches the name, but the haplotype field cannot be parsed, because it's not an integer.

If all path names have the same pattern, you can specify it with options --path-regex and --path-fields. Unfortunately we do not expose the ability to specify multiple patterns at the moment.