Open pioneer-pi opened 2 months ago
If you want Python code to read vg's formats directly, you can try either the libbdsg Python bindings which can read most .vg files (which are usually now in PackedGraph format), or pystream-protobuf which knows how to decode vg's Protobuf-based formats like .snarls, .gam, and .vg as written straight from vg construct
. If you have a .vg that isn't in PackedGraph format internally and you want to use it with libbdsg, vg convert
can convert it.
If you do want to convert to a text-based format, you have some options in vg:
You can use vg convert --gfa-out graph.vg
to convert graphs to the text-based standard GFA format.
You can use vg convert --gam-to-gaf whatever.gam
to convert a GAM alignment file to the standard text-based GAF format.
The vg view
command also has options to dump miscellaneous formats like Protobuf .snarls to a JSON representation. You can say vg --snarl-in --json whatever.snarls
and it will print out JSON.
Thank you, I get it. I want to read gam format and i check out the pystream-protobuf. I found that I need to import stream and vg_pb2 to parse GAM file. I install stream, but I don't know how to use vg_pb2.
vg_pb2
is the Python module generated by Protobuf to provide Python language bindings for the Protobuf-defined types that vg uses (such as Alignment
, which represents a GAM read). It doesn't really quite have its own documentation, but you can look at the Protobuf definition of an Alignment message, or the Doxygen docs for Alignment, and at the documentation for how Protobuf messages get their fields and accessors exposed to Python.
It looks like all the non-message fields just become Python fields. So if you want to print out all the read names, you would get the Alignment
type out of vg_pb2
, you would use stream
to loop over the file looking for Alignment
s, and for each alignment a
you would print out a.name
.
@adamnovak Thank you! And I have another question: I use vg snarl to get the snarl fie of a variation graph. Can I get the reads that mapping to each snarl?(It means I can get snarl and reads that match to a special snarl) Does vg support it?
It looks like you can take your snarls file and provide it to vg chunk
with the --snarls
option. Then you can use --path
or --path-list
to ask for region(s) along a path in the graph that goes through the snarl, and vg chunk
will use the snarl information to pull out complete snarls. And if you use --gam-name
with vg chunk
, it will pull out reads from that (sorted and indexed) GAM instead of the graph.
But we don't have a way to ask vg chunk
to pull a snarl directly by giving its bounding node ends. So you have to do something like:
123
and 456
)vg find -x graph.vg -n 123 | vg paths --list
)vg find -x graph.vg -n 123 --position-in pathname
)pathname:1000-2000
).gai
file (vg gamsort reads.gam -i reads.sorted.gam.gai >reads.sorted.gam
)vg chunk -x graph.vg -a reads.sorted.gam --path pathname:1000-2000 --snarls whatever.snarls
)This would work way better if we actually had C++ code somewhere to pull all the nodes in a specified snarl, or in each snarl, and grab the reads touching them.
If you want to do it as a batch process and not just look at one snarl and its reads, you can make it work with pystream-protobuf
and the new snarl decomposition in the libbdsg Python bindings which presents the new SnarlDecomposition API, but that loads snarls in a different format. You would vg index graph.vg -j graph.snarlsonly.dist --snarl-limit 0
to build it, and then load it up:
from bdsg.bdsg import SnarlDistanceIndex
index = SnarlDistanceIndex()
index.deserialize("graph.snarlsonly.dist")
And you would also need the graph loaded in order to traverse the "nets" that the new snarl decomposition API uses:
from bdsg.bdsg import PackedGraph
graph = PackedGraph()
graph.deseriaslize("graph.vg")
But then you can loop over all your reads with pystream-protobuf
, and go through the node_id
in the position
of each entry in mapping
in the Alignment's path
, and then for each of them figure out what snarls the node is in. (Since the snarl tree is a tree, each node can be in several nested snarls, back up to the root.) Something like:
node_handle = graph.get_handle(node_id, False)
node_net_handle = index.get_net(node_handle, graph)
here_net_handle = index.canonical(node_net_handle)
node_snarl_bounds = []
while not index.is_root(here_net_handle):
if index.is_snarl(here_net_handle):
# Make a list of tuples of the node ID and orientation for the nodes before and after the snarl.
bounds = []
def visit(net_handle):
handle = index.get_handle(net_handle, graph)
bounds.append((graph.get_id(handle), graph.get_is_reverse(handle))
# Get the node before the snarl
index.follow_net_edges(here_net_handle, graph, True, visit)
# Get the node after the snarl
index.follow_net_edges(here_net_handle, graph, False, visit)
# Put them in the list of ancestor snarl bound pairs for the node we started at
node_snarl_bounds.append(tuple(bounds))
here_net_handle = index.canonical(index.get_parent(here_net_handle))
Then you can collect all the unique snarls that all the nodes visited by the read appear in, and then take the read and store it somewhere where you can look at it when you want to think about each of those snarls.
You might want to skip the trivial snarls that don't have any contents between their bounding nodes. I think you can ID them because for_each_child
on them won't visit anything.
Now If I want to execute vg's special format files like .vg,.snarls,.gam and so on. How can I read it directly to python with out transfer the special format to normal format(like json or txt)?
Thank you!