vg chunk makes up region sizes based on bounding ID differences, which can be wrong or negative

adamnovak commented 3 years ago

Apparently when we wrote the code for vg chunk haplotype tracing, we thought that all graphs would have one ID increment per node along whatever path we're using to define our region of interest. We determine the number of steps to trace haplotypes along by subtracting the node ID at the start position on our path from the one at the end position.

https://github.com/vgteam/vg/blob/acc09ad248adbf2d92f16111b563118f5279ac45/src/subcommand/chunk_main.cpp#L637-L649

This won't be correct even with the VCF-based graphs: we'll end up tracing slightly too far along, say, the all-ref haplotypes, because some node IDs will be used in variants. But now with the HPRC non-VCF-based graphs, we're seeing reference paths that can run through regions of the graph in decreasing node ID order, so the step counts we compute this way can actually be negative, because the node ID at the end of the range on the path is smaller than the one at the start.

This seems to cause the haplotypes to be traced out very far (possibly to the end of the graph?). It might be because of the comparison here between a 64-bit unsigned count of the steps we've done, and the signed 32-bit int holding the inconveniently negative limit:

https://github.com/vgteam/vg/blob/acc09ad248adbf2d92f16111b563118f5279ac45/src/haplotype_extracter.cpp#L227

You can then end up with a lot of what should be identical haplotypes being reported as distinct. I've added an example test for this in 3288111a4bc97a07f24bed1c2c4ad28cf86f195b in the trace-ordering branch.

glennhickey commented 3 years ago

That code doesn't make any sense. Sorry! Happy to fix, but would have to be in a week.

glennhickey commented 3 years ago

This is improved in #3386 but looking at the haplotype extractor code, i think it'll still need some work to be fully general (ie support all cases of reference steps on arbitrary strands) .

vgteam / vg

vg chunk makes up region sizes based on bounding ID differences, which can be wrong or negative #3375