vg find outputs wrong nodes

esrice commented 1 year ago

1. What were you trying to do? Extract a subgraph from a larger graph, using this command:

vg find -p 'bGalGal1b#0#chrZ:11159196-11400464' -x pangenome.gbz > k_locus.gbz

2. What did you want to happen? I would expect that the nodes extracted for this region would be the same nodes referred to in the vcf for this region, e.g., the vg header contains this line:

##contig=<ID=bGalGal1b#0#chrZ,length=86044486>

and subsetting the vcf to the same region returns lines like this:

chrZ    11237453        >47495243>47495248      G       GGTAGTGAAGCCT

3. What actually happened? vg find returned a subgraph that does not contain nodes 47495243 and 47495248, but instead the nodes IDs are in the range 29961990-29981305. On examination of the subgraph structure in bandage, it does not appear that the issue is node IDs being shifted, but rather this is not the part of the graph covered by bGalGal1b#0#chrZ:11159196-11400464 as requested.

4. If you got a line like Stack trace path: /somewhere/on/your/computer/stacktrace.txt, please copy-paste the contents of that file here:

NA

5. What data and command can the vg dev team use to make the problem happen?

The gbz and vcf are direct output from the minigraph-cactus pipeline. I can share these files if necessary.

6. What does running vg version say?

vg version v1.46.0 "Altamura"
Compiled with g++ (Ubuntu 10.3.0-1ubuntu1~20.04) 10.3.0 on Linux
Linked against libstd++ 20210408
Built by xian@octo

esrice commented 1 year ago

Sorry, I think the issue is that the gfa version of the graph output by minigraph-cactus is not the same as the gbz version. Not an issue with vg.

glennhickey commented 1 year ago

IDs are different between GFA and GBZ (which is an ongoing source of confusion). Please see here for more information:

https://github.com/ComparativeGenomicsToolkit/cactus/blob/master/doc/pangenome.md#node-chopping

esrice commented 1 year ago

Thanks, got it.

vgteam / vg

vg find outputs wrong nodes #3928