vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.08k stars 191 forks source link

I have a question about x.snarls file. #4206

Closed pioneer-pi closed 2 months ago

pioneer-pi commented 5 months ago

I get a snarl file throughvg snarls *.xg > *.snarls. What's the information of snarls file? I thought it was bubble of graph(it means between the start node and end node of a snarl, there has variations). However, I found it wasn't. There are some snarls that don't have variation in it. For example, the first picture is two lines in a snarl file. The second picture is the corresponding subgraph. The second line of first picture shows there is a snarl between node 1106501 and 1106504. However, there is no variation according to the subgraph. So I wonder what's the real meaning of information in a snarl file?

image 1704896354552

Thanks!

jeizenga commented 5 months ago

It looks like you are providing multiple graphs to vg snarls. Is that correct? If so, it could be that your graphs don't have disjoint node IDs. It might also be relevant how you extracted the subgraph and created that image.

pioneer-pi commented 5 months ago

@jeizenga Exactly, I am executing Working with a whole genome variation graph process.

  1. I use (seq 1 22; echo X; echo Y) | parallel -j 24 "time vg construct -C -R {} -r $ref -v $vars -t 1 -m 32 >{}.vg" to create 24 vg files.
  2. I use vg ids -j $(for i in $(seq 1 22; echo X; echo Y); do echo $i.vg; done) to generate a joint id space across each graph
  3. I use vg index -x wg.xg $(for i in $(seq 22; echo X; echo Y); do echo $i.vg; done) to create xg index.
  4. I use `vg find -p 1:10000000-10001000 -x 1.xg | vg view -d - > viz.dot ' to check the subgraph.
  5. I use 'vg snarls 1.xg > 1.snarls" to get all the snarls. Then I get the above results.
pioneer-pi commented 5 months ago

@jeizenga I found another question, when I use for chr in $(seq 1 22; echo X; echo Y); do vg prune -r $chr.vg > $chr.pruned.vg done to prune graph, the path disappered. So I can't find a subgraph through vg find to locate a special region.

jeizenga commented 5 months ago

I think the issue is that you're only extracting the path itself and not the non-reference variants. If you want to use vg find for this, you should also specify a context length (-c) to extract the surrounding nodes as well. Alternatively, you can use vg chunk with the -S option to extract the subregion and all its variants.

Can you show me the output of vg paths -L on the pruned graph?

pioneer-pi commented 5 months ago

@jeizenga Thanks for you replying. The code vg paths -L -v 1.pruned.vg has no output. I found another thing: I use vg find -p 1:10000000-10001000 -x 1.xg > sub1.vg to get a subgraph. Then I use vg snarls sub1.vg > sub1.snarls to get a snarl file. I found the context in this snarl file are correct. It show the bubble correctly. But When I get a snarl file through 'vg snarls 1.vg > 1.snarls'(It means according the whole 1 file to get snarls), there are many 'fake bubbles'(There are no bubbles in variation graph, but recorded in the snarl file).

jeizenga commented 5 months ago

Regarding vg prune, I've had a look and I think that this is expected behavior. The vg prune algorithm is meant as a preprocessing step for creating a GCSA2 index, and the GCSA2 construction algorithm doesn't make any use of the paths, so vg prune doesn't bother adding the paths to the pruned graph, even if you use -r. The -r option only guarantees that edges on paths won't be pruned away.

It looks like you're also using vg find without the -c option when you use it first. I would expect that you only see insertion/deletion bubbles when you do that, because you're not extracting the substitution nodes. Is that correct? If so, the problem is the graph extraction with vg find, not the algorithm in vg snarls.

pioneer-pi commented 5 months ago

@jeizenga That's right! And how to use vg find -c? I can't understand what information need to add after vg find -c and its effect. The vg find help is: -c, --context STEPS expand the context of the subgraph this many steps

jeizenga commented 5 months ago

It will also extract all of the nodes that you can reach from the path by following STEPS edges from one of the nodes on the paths. It's a bit tricky to parameterize it well, to be honest, because the correct value depends on the length of the variants. I generally prefer using vg chunk -S to extract graph regions.

pioneer-pi commented 5 months ago

@jeizenga Thank you very much! And I have a confusion about the length of node sequence. In the subgraph, I found there are many nodes which length is dozens of bp or even several bp, but there are no variant(no bubble). Why these nodes aren't merged together

jeizenga commented 5 months ago

That's a common point of confusion. It's basically for algorithmic convenience. In several places, we have a hard limit of 1023 bases because we only use 10 bits to express positions on the node. Usually, we set the maximum node length quite a bit shorter than the hard limit (typically 32 bp) because that lets us indicate subgraphs more precisely without "overshooting" the region that we want.

pioneer-pi commented 5 months ago

@jeizenga Thank you! https://github.com/vgteam/vg/issues/4206#issue-2074485853 The original question still exists. The following is my operation (I follow the operation of working with a whole genome variation graph): https://github.com/vgteam/vg/issues/4206#issuecomment-1886097522 I did vg ids operation, but there is also the question that the snarl file information is incorrect. Some snarl in file does not exist.

jeizenga commented 5 months ago

Try using vg find -p 1:10000000-10001000 -x 1.xg -c 2 to extract the graph before visualizing it.

pioneer-pi commented 5 months ago

@jeizenga I did it and visualized it , found there are new bubbles. So which graph is correct? Why I use vg find without -cwill get a subgraph has a little bubbles? If I want to extract the real subgraph of a region, what's the value of -c should be?

pioneer-pi commented 5 months ago

@jeizenga In addition, I want to know the meaning of snarl file field. For example, the meaning of type field in the following line: {"directed_acyclic_net_graph": true, "end": {"node_id": "1106504"}, "start": {"node_id": "1106501"}, "start_end_reachable": true, "type": 1}

jeizenga commented 5 months ago

Neither the -c graph nor the non--c graph is incorrect. They're just different queries. When you query the graph without -c, you get only the nodes of the path itself. When you query it with -c 1, you also get those nodes' neighbors, which include the alternate alleles of SNP bubbles. If you query it with -c 2 you also get the neighbors' neighbors, etc. I suspect the graph queried with -c is closer to what you want, but there's no firm rule about how large -c needs to be to always get the alternate alleles.

I am saying this again: I strongly recommend using vg chunk with -S instead of vg find. If you do that, you are guaranteed to get all of the bubbles.

The types of snarls are described in the schema. This is an "ultrabubble", which means that it contains no cycles or inversions. https://github.com/vgteam/libvgio/blob/master/deps/vg.proto#L246-L251

pioneer-pi commented 5 months ago

@jeizenga Now I use vg pack to get a packfile, They are read coverage in each base of variation graph. I want to get all the reads mapping to a special base position.For examples, I know base which node_id=3 and node_offset=0 has a value of read coverage, which is 53. Can I get all the 53 reads?

seq.pos   node.id    node.offset     coverage
4                  3                 0                  53