Closed pioneer-pi closed 2 months ago
It looks like you are providing multiple graphs to vg snarls
. Is that correct? If so, it could be that your graphs don't have disjoint node IDs. It might also be relevant how you extracted the subgraph and created that image.
@jeizenga Exactly, I am executing Working with a whole genome variation graph process.
(seq 1 22; echo X; echo Y) | parallel -j 24 "time vg construct -C -R {} -r $ref -v $vars -t 1 -m 32 >{}.vg"
to create 24 vg files.vg ids -j $(for i in $(seq 1 22; echo X; echo Y); do echo $i.vg; done)
to generate a joint id space across each graphvg index -x wg.xg $(for i in $(seq 22; echo X; echo Y); do echo $i.vg; done)
to create xg index.@jeizenga I found another question, when I use
for chr in $(seq 1 22; echo X; echo Y); do vg prune -r $chr.vg > $chr.pruned.vg done
to prune graph, the path disappered. So I can't find a subgraph through vg find
to locate a special region.
I think the issue is that you're only extracting the path itself and not the non-reference variants. If you want to use vg find
for this, you should also specify a context length (-c
) to extract the surrounding nodes as well. Alternatively, you can use vg chunk
with the -S
option to extract the subregion and all its variants.
Can you show me the output of vg paths -L
on the pruned graph?
@jeizenga Thanks for you replying. The code vg paths -L -v 1.pruned.vg
has no output. I found another thing: I use vg find -p 1:10000000-10001000 -x 1.xg > sub1.vg
to get a subgraph. Then I use vg snarls sub1.vg > sub1.snarls
to get a snarl file. I found the context in this snarl file are correct. It show the bubble correctly. But When I get a snarl file through 'vg snarls 1.vg > 1.snarls'(It means according the whole 1 file to get snarls), there are many 'fake bubbles'(There are no bubbles in variation graph, but recorded in the snarl file).
Regarding vg prune
, I've had a look and I think that this is expected behavior. The vg prune
algorithm is meant as a preprocessing step for creating a GCSA2 index, and the GCSA2 construction algorithm doesn't make any use of the paths, so vg prune
doesn't bother adding the paths to the pruned graph, even if you use -r
. The -r
option only guarantees that edges on paths won't be pruned away.
It looks like you're also using vg find
without the -c
option when you use it first. I would expect that you only see insertion/deletion bubbles when you do that, because you're not extracting the substitution nodes. Is that correct? If so, the problem is the graph extraction with vg find
, not the algorithm in vg snarls
.
@jeizenga That's right! And how to use vg find -c? I can't understand what information need to add after vg find -c and its effect.
The vg find help is:
-c, --context STEPS expand the context of the subgraph this many steps
It will also extract all of the nodes that you can reach from the path by following STEPS
edges from one of the nodes on the paths. It's a bit tricky to parameterize it well, to be honest, because the correct value depends on the length of the variants. I generally prefer using vg chunk -S
to extract graph regions.
@jeizenga Thank you very much! And I have a confusion about the length of node sequence. In the subgraph, I found there are many nodes which length is dozens of bp or even several bp, but there are no variant(no bubble). Why these nodes aren't merged together
That's a common point of confusion. It's basically for algorithmic convenience. In several places, we have a hard limit of 1023 bases because we only use 10 bits to express positions on the node. Usually, we set the maximum node length quite a bit shorter than the hard limit (typically 32 bp) because that lets us indicate subgraphs more precisely without "overshooting" the region that we want.
@jeizenga Thank you! https://github.com/vgteam/vg/issues/4206#issue-2074485853 The original question still exists.
The following is my operation (I follow the operation of working with a whole genome variation graph):
https://github.com/vgteam/vg/issues/4206#issuecomment-1886097522
I did vg ids
operation, but there is also the question that the snarl file information is incorrect. Some snarl in file does not exist.
Try using vg find -p 1:10000000-10001000 -x 1.xg -c 2
to extract the graph before visualizing it.
@jeizenga I did it and visualized it , found there are new bubbles. So which graph is correct? Why I use vg find
without -c
will get a subgraph has a little bubbles? If I want to extract the real subgraph of a region, what's the value of -c
should be?
@jeizenga In addition, I want to know the meaning of snarl file field.
For example, the meaning of type
field in the following line:
{"directed_acyclic_net_graph": true, "end": {"node_id": "1106504"}, "start": {"node_id": "1106501"}, "start_end_reachable": true, "type": 1}
Neither the -c
graph nor the non--c
graph is incorrect. They're just different queries. When you query the graph without -c
, you get only the nodes of the path itself. When you query it with -c 1
, you also get those nodes' neighbors, which include the alternate alleles of SNP bubbles. If you query it with -c 2
you also get the neighbors' neighbors, etc. I suspect the graph queried with -c
is closer to what you want, but there's no firm rule about how large -c
needs to be to always get the alternate alleles.
I am saying this again: I strongly recommend using vg chunk
with -S
instead of vg find
. If you do that, you are guaranteed to get all of the bubbles.
The types of snarls are described in the schema. This is an "ultrabubble", which means that it contains no cycles or inversions. https://github.com/vgteam/libvgio/blob/master/deps/vg.proto#L246-L251
@jeizenga Now I use vg pack to get a packfile, They are read coverage in each base of variation graph. I want to get all the reads mapping to a special base position.For examples, I know base which node_id=3 and node_offset=0 has a value of read coverage, which is 53. Can I get all the 53 reads?
seq.pos node.id node.offset coverage
4 3 0 53
I get a snarl file through
vg snarls *.xg > *.snarls
. What's the information of snarls file? I thought it was bubble of graph(it means between the start node and end node of a snarl, there has variations). However, I found it wasn't. There are some snarls that don't have variation in it. For example, the first picture is two lines in a snarl file. The second picture is the corresponding subgraph. The second line of first picture shows there is a snarl between node 1106501 and 1106504. However, there is no variation according to the subgraph. So I wonder what's the real meaning of information in a snarl file?Thanks!