vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.12k stars 194 forks source link

json format #4011

Open sdws1983 opened 1 year ago

sdws1983 commented 1 year ago

Hi,

I am using vg view -aM filtered.gam to view the output file of the aligments, I try to interpret the json format. I know some basic concepts, such as offset represents the number of bases offset, and sequence represents the mutated sequence, but I don't quite understand the meaning of rank. In some alignments, the same rank appears multiple times:

image

In addition, there are some node information, such as:

{"edit": [{"from_length": 6}, {"from_length": 1, "to_length": 1}], "position": {"node_id": "180334880"}, "rank": "2" }

What does from_length appear twice?

Is there some clarification about the json format?

Thanks

Yumin

glennhickey commented 1 year ago

Please see here for documentation of the various formats

https://github.com/vgteam/vg/wiki/File-Formats

In particular, that page links to the Protobuf definitions for gam alignments

https://github.com/vgteam/libvgio/blob/eb1fe76878aff8f26f0a2f38a1c133ec2f353e57/deps/vg.proto#L109-L151

For the duplicate ranks: that looks like a bug . But I don't think any vg code uses ranks in alignment paths for anything.

For the two from_lengths: That's be cause your array has two Edits, and each one has a from_length.

sdws1983 commented 1 year ago

Please see here for documentation of the various formats

https://github.com/vgteam/vg/wiki/File-Formats

In particular, that page links to the Protobuf definitions for gam alignments

https://github.com/vgteam/libvgio/blob/eb1fe76878aff8f26f0a2f38a1c133ec2f353e57/deps/vg.proto#L109-L151

For the duplicate ranks: that looks like a bug . But I don't think any vg code uses ranks in alignment paths for anything.

For the two from_lengths: That's be cause your array has two Edits, and each one has a from_length.

So, what does from_length and to_length mean? I do not fully understand.

jeizenga commented 1 year ago

It's phrased as if you are modifying the graph sequence into the read sequence, so you take ref sequence of size from_length and replace it with a read sequence of size to_length. In the case of a match, you might replace it with the same sequence again.