I have a question while using vg.

pioneer-pi commented 2 months ago

Now If I want to execute vg's special format files like .vg,.snarls,.gam and so on. How can I read it directly to python with out transfer the special format to normal format(like json or txt)?

Thank you!

adamnovak commented 2 months ago

If you want Python code to read vg's formats directly, you can try either the libbdsg Python bindings which can read most .vg files (which are usually now in PackedGraph format), or pystream-protobuf which knows how to decode vg's Protobuf-based formats like .snarls, .gam, and .vg as written straight from vg construct. If you have a .vg that isn't in PackedGraph format internally and you want to use it with libbdsg, vg convert can convert it.

If you do want to convert to a text-based format, you have some options in vg:

You can use vg convert --gfa-out graph.vg to convert graphs to the text-based standard GFA format.

You can use vg convert --gam-to-gaf whatever.gam to convert a GAM alignment file to the standard text-based GAF format.

The vg view command also has options to dump miscellaneous formats like Protobuf .snarls to a JSON representation. You can say vg --snarl-in --json whatever.snarls and it will print out JSON.

pioneer-pi commented 2 months ago

Thank you, I get it. I want to read gam format and i check out the pystream-protobuf. I found that I need to import stream and vg_pb2 to parse GAM file. I install stream, but I don't know how to use vg_pb2.

adamnovak commented 2 months ago

vg_pb2 is the Python module generated by Protobuf to provide Python language bindings for the Protobuf-defined types that vg uses (such as Alignment, which represents a GAM read). It doesn't really quite have its own documentation, but you can look at the Protobuf definition of an Alignment message, or the Doxygen docs for Alignment, and at the documentation for how Protobuf messages get their fields and accessors exposed to Python.

It looks like all the non-message fields just become Python fields. So if you want to print out all the read names, you would get the Alignment type out of vg_pb2, you would use stream to loop over the file looking for Alignments, and for each alignment a you would print out a.name.

pioneer-pi commented 2 months ago

@adamnovak Thank you! And I have another question: I use vg snarl to get the snarl fie of a variation graph. Can I get the reads that mapping to each snarl?(It means I can get snarl and reads that match to a special snarl) Does vg support it?

adamnovak commented 2 months ago

It looks like you can take your snarls file and provide it to vg chunk with the --snarls option. Then you can use --path or --path-list to ask for region(s) along a path in the graph that goes through the snarl, and vg chunk will use the snarl information to pull out complete snarls. And if you use --gam-name with vg chunk, it will pull out reads from that (sorted and indexed) GAM instead of the graph.

But we don't have a way to ask vg chunk to pull a snarl directly by giving its bounding node ends. So you have to do something like:

Get the node IDs bounding your snarl (123 and 456)
Get the lists of paths on each of them (vg find -x graph.vg -n 123 | vg paths --list)
Pick a path that both are on and get each node's positions on that path (vg find -x graph.vg -n 123 --position-in pathname)
Come up with the path interval you are interested in that you think the snarl covers (pathname:1000-2000)
Make sure your GAM is sorted and indexed fro random access and has a .gai file (vg gamsort reads.gam -i reads.sorted.gam.gai >reads.sorted.gam)
Fetch out the reads (vg chunk -x graph.vg -a reads.sorted.gam --path pathname:1000-2000 --snarls whatever.snarls)

This would work way better if we actually had C++ code somewhere to pull all the nodes in a specified snarl, or in each snarl, and grab the reads touching them.

If you want to do it as a batch process and not just look at one snarl and its reads, you can make it work with pystream-protobuf and the new snarl decomposition in the libbdsg Python bindings which presents the new SnarlDecomposition API, but that loads snarls in a different format. You would vg index graph.vg -j graph.snarlsonly.dist --snarl-limit 0 to build it, and then load it up:

from bdsg.bdsg import SnarlDistanceIndex
index = SnarlDistanceIndex()
index.deserialize("graph.snarlsonly.dist")

And you would also need the graph loaded in order to traverse the "nets" that the new snarl decomposition API uses:

from bdsg.bdsg import PackedGraph
graph = PackedGraph()
graph.deseriaslize("graph.vg")

But then you can loop over all your reads with pystream-protobuf, and go through the node_id in the position of each entry in mapping in the Alignment's path, and then for each of them figure out what snarls the node is in. (Since the snarl tree is a tree, each node can be in several nested snarls, back up to the root.) Something like:

node_handle = graph.get_handle(node_id, False)
node_net_handle = index.get_net(node_handle, graph)
here_net_handle = index.canonical(node_net_handle)
node_snarl_bounds = []
while not index.is_root(here_net_handle):
    if index.is_snarl(here_net_handle):
        # Make a list of tuples of the node ID and orientation for the nodes before and after the snarl.
        bounds = []
        def visit(net_handle):
            handle = index.get_handle(net_handle, graph)
            bounds.append((graph.get_id(handle), graph.get_is_reverse(handle))
        # Get the node before the snarl
        index.follow_net_edges(here_net_handle, graph, True, visit)
        # Get the node after the snarl
        index.follow_net_edges(here_net_handle, graph, False, visit)
        # Put them in the list of ancestor snarl bound pairs for the node we started at
        node_snarl_bounds.append(tuple(bounds))
    here_net_handle = index.canonical(index.get_parent(here_net_handle))

Then you can collect all the unique snarls that all the nodes visited by the read appear in, and then take the read and store it somewhere where you can look at it when you want to think about each of those snarls.

adamnovak commented 2 months ago

You might want to skip the trivial snarls that don't have any contents between their bounding nodes. I think you can ID them because for_each_child on them won't visit anything.

vgteam / vg

I have a question while using vg. #4261