vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.1k stars 194 forks source link

`vg convert --vg-algorithm` loses start coordinates of paths in its W lines #4377

Closed zhangyixing3 closed 1 month ago

zhangyixing3 commented 2 months ago

Dear sir, I would like to convert a filtered pangenome GBZ index back to the original GFA . During this process, to maintain consistency with the full or clip graph nodes, I used the --vg-algorithm option. However, I noticed that the path information for non-reference paths in the 97samples.d10.vg.algorithm.gfa file seems to be incorrect. This seems a bit abnormal ?

vg convert -f 97samples.d10.gbz   --vg-algorithm > 97samples.d10.vg.algorithm.gfa
vg convert -f 97samples.d10.gbz   > 97samples.d10.gfa

results

  1. 97samples.d10.gfa OK

    grep "^W"  97samples.d10.gfa | less -S                    

    image

  2. 97samples.d10.vg.algorithm.gfa The coordinate information is lost, and all start positions are 0

    grep "^W"   97samples.d10.vg.algorithm.gfa | less -S                    

    image

adamnovak commented 2 months ago

It looks like the VG algorithm is meant to preserve the start offset of the path: https://github.com/vgteam/vg/blob/e9fbbc31506a0364b222a7a328bbaec7edd7ffa6/src/gfa.cpp#L197-L204

Maybe the GBZ is not actually exposing these paths as having start offsets? If you run vg paths --metadata --sample 001_6137 -x 97samples.d10.gbz, do these paths claim to have NO_SUBRANGE or do they properly list their subrange coordinates on the base path?

jltsiren commented 2 months ago

The PathMetadata implementation for GBWTGraph assumes that subranges can only exist for reference/generic paths, and only haplotype paths can have phase blocks. The GBWTGraph algorithm avoids the issue, because it works with GBWT / GBWTGraph semantics rather than libhandlegraph semantics.

@zhangyixing3 The underlying issue is that GBZ was designed to both store the original GFA and expose an equivalent graph with integer node identifiers and nodes no longer than 1024 bp. If you use the GBWTGraph algorithm to convert GBZ back to GFA, you get the original GFA, where segments can have string names and be arbitrarily long. If you want a GFA with integer node identifiers and short nodes, you can use option --no-translation with the GBWTGraph algorithm.

zhangyixing3 commented 1 month ago

Thank you, the--vg-algorithm parameter indeed results in W lines representing offsets. Using --no-translation, I successfully obtained graphs where node lengths do not exceed 1024, along with their coordinate information.