vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.09k stars 193 forks source link

`vg autoindex` will not preserve the input GFA node ID space if it has any nodes above 32 bp. #4148

Open adamnovak opened 10 months ago

adamnovak commented 10 months ago

We have an unconditional chop when making a graph from GFA:

https://github.com/vgteam/vg/blob/5e1973315f33b5f0394f3ff0bfe88649785751ce/src/index_registry.cpp#L1822-L1830

And the chop size is set to 32, and there's no CLI interface for changing it in vg autoindex.

This means that, for Giraffe, people can feed in GFAs that vg could represent in their original node ID space (i.e. the node IDs are all numerical, and none of the nodes are too long for the actual Giraffe indexes to support), but we change the node IDs anyway. The resulting GAM files from mapping with Giraffe can't be understood in terms of the original GFAs.

You could run Giraffe with --named-coordinates to get output in the original GFA space still, but then it is in the name field in the GAM positions and not in the node_id field, so you also need to so some custom GAM surgery if you want a GAM that will validate against the original GFA with vg validate. Something like:

vg view -aj mapped.gam | jq '.path.mapping = [(.path.mapping // [])[] | (.position.node_id = .position.name) | (.position.name = "")]' | vg view -JGa - >gfa_coordinates.gam

This is of course going to be very slow.

I think this is the root cause of https://github.com/vgteam/sequenceTubeMap/issues/360

To fix this, we could:

TheHarshShow commented 7 months ago

Yes, I was also facing this issue. I converted the GBZ back to GFA using GBWT-Graph with --no-translations to get the node IDs of the new graph.