vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.09k stars 193 forks source link

Multiple variant input format support #2037

Open adamnovak opened 5 years ago

adamnovak commented 5 years ago

@jltsiren thinks that the VCF parsing with vcflib is taking a lot of time when we're trying to build graphs from really big VCFs for TOPmed.

TOPmed data is also available in BCF, and in a custom format they have developed which is supposed to be even smaller. If we could parse those instead, we could speed up our TOPmed graph construction.

We should refactor construction, simplification, haplype generation for GBWT construction, and anywhere else we use VCF directly to instead use our own internal variant format, and have vcflib as well as something to parse BCF, and maybe TOPmed's format, as pluggable input formats.

We could also save some variant file parsing passes if we could output our variants in the graph not as the alt paths we have been using but as the GBWT library's internal parsed variant format, which is indexed by variant number, and if we could send along frequency information that we could use for graph simplification.

adamnovak commented 5 years ago

This is related to https://github.com/vgteam/vg/issues/354