`vg autoindex` will not preserve the input GFA node ID space if it has any nodes above 32 bp.

We have an unconditional chop when making a graph from GFA:

https://github.com/vgteam/vg/blob/5e1973315f33b5f0394f3ff0bfe88649785751ce/src/index_registry.cpp#L1822-L1830

And the chop size is set to 32, and there's no CLI interface for changing it in vg autoindex.

This means that, for Giraffe, people can feed in GFAs that vg could represent in their original node ID space (i.e. the node IDs are all numerical, and none of the nodes are too long for the actual Giraffe indexes to support), but we change the node IDs anyway. The resulting GAM files from mapping with Giraffe can't be understood in terms of the original GFAs.

You could run Giraffe with --named-coordinates to get output in the original GFA space still, but then it is in the name field in the GAM positions and not in the node_id field, so you also need to so some custom GAM surgery if you want a GAM that will validate against the original GFA with vg validate. Something like:

vg view -aj mapped.gam | jq '.path.mapping = [(.path.mapping // [])[] | (.position.node_id = .position.name) | (.position.name = "")]' | vg view -JGa - >gfa_coordinates.gam

This is of course going to be very slow.

I think this is the root cause of https://github.com/vgteam/sequenceTubeMap/issues/360

To fix this, we could:

Find a way to not chop the GFA.
Express the GAM in numbered original GFA coordinates by default or with an option.
Teach downstream GAM reading tools in vg how to use name fields in positions.
Loudly complain and explain what the user is going to need to do to get output in their GFA's space in a format other vg tools can continue to work with, when we decide to chop nodes or re-ID a GFA file when indexing.

vgteam / vg

`vg autoindex` will not preserve the input GFA node ID space if it has any nodes above 32 bp. #4148