Slicing a graph using vg chunk or vg find

vgteam / vg

tools for working with genome variation graphs

https://biostars.org/tag/vg/

Other

1.13k stars 195 forks source link

Slicing a graph using vg chunk or vg find #3258

Open brettChapman opened 3 years ago

brettChapman commented 3 years ago

I have a whole chromosome graph from a pangenome alignment which includes paths along different genomes (20). I've embedded genes/exon/CDS into the graph based on a single path (the known and well annotated reference Morex v2). I want to slice up the graph around each of the genes based on their coordinates, but I still want to include all intersecting nodes and paths from all the other genomes to visualise the variations in these regions. My intent is to visualise in sequenceTubeMap each of these sub-graphs. I've tried visualising the whole pangenome graph, but sequenceTubeMap times out because the requests to vg paths and vg chunk take too long to execute, due to the size of the graph and the large number of paths.

I've tried:

vg chunk -x barley_pangenome_1H_augmented.xg -c 0 -p Morex_v2_chr1H:252162-254791 -T -E vg_chunks/Morex_v2_chr1H_252162-254791_regions.tsv -O pg > vg_chunks/barley_pangenome_1H_augmented_Morex_v2_chr1H_252162-254791.pg

But the only paths I find in the graph are from Morex v2 and the embedded gene.

Should I be using vg find instead for such a task?

vg find -p Morex_v2_chr1H:252162-254791 -x barley_pangenome_1H_augmented.xg -c 0

I'm unsure of what the best approach to take is. I tried following the commands which sequenceTubeMap uses with vg chunk (that I see outputted to terminal), however there must be something else going on behind the scenes which generates the resulting tube map.

Thank you for any help you can provide.

AndreaGuarracino commented 3 years ago

Hi Brett,

I'm not sure how suitable it would be for your workflow with vg/sequenceTubeMap, but you could try odgi extract to extract subgraphs. Its interface is very similar to that of vg find/chunk.

Assuming you have converted your graphs to the odgi format, a basic command would be:

odgi extract -i barley_pangenome_1H_augmented.og -t 16 -r Morex_v2_chr1H:252162-254791 -o barley_pangenome_1H_augmented_Morex_v2_chr1H_252162-254791.og

This would extract all nodes crossed by the Morex_v2_chr1H path in the 252162-254791 range, and the portions of the other paths that traverse those nodes.

If your graph is not particularly complex or crazy, you can add -E/--full-range to the command to extract a more dense piece: the Morex_v2_chr1H:252162-254791 path range will define a node_id_min and a node_id_max, and -E will force to collect all nodes in the sorted order of the graph between these min and max. This assumes that the graph is well sorted and that the node ID space is compact.

odgi extract is pretty fast, but I want to point out that it works without indexing the graph, so it may not be suitable if you have to make a lot of consecutive extractions.

brettChapman commented 3 years ago

Thanks @AndreaGuarracino I'll give odgi extract a shot. My graph is a smoothed graph outputted from PGGB. I've embedded genes into the graph from the GFF of Morex v2. I would have to iterate over every gene coordinate to generate many sub-graphs. I'll give it a try anyway, and see how the resulting graphs look. I'll provide an update after I've tried it out.

glennhickey commented 3 years ago

vg chunk should work fine for this. A couple of things:

you won't see any variation with -c 0. as this means it will extract only the linear path range you specify. the higher you set -c, the more variation it will pull in from outside the path
the -T option tells chunk to pull in paths from a GBWT. if you have a GBWT, you need to specify it with -G, otherwise probably best to leave out the -T.

Otherwise, chunk will pull out every other path in your graph that touches any node in the subgraph it returns. If you have a path in your xg that touches a node in the graph that chunk outputs, but this path does not appear in the output graph, please report a bug.

One last thing is that chunk can change the names of all paths but the one passed with -p. They will get suffixes in [] reflecting the fact that they are fragments.

brettChapman commented 3 years ago

Thanks @glennhickey, I'll have another shot at vg chunk. In sequenceTubeMap vg chunk was using -c 20. I assumed it meant 20 nodes either side of the coordinates. I didn't realise it would also include nodes outside the path.

I set -T because it was also set by sequenceTubeMap when it ran vg chunk. I don't have any GBWT files, only my indexed graph.

Yes, I've noticed when only partial paths are in the graph, they have a suffix of [].

Which value should I use for -c? Should I define -c as the number of genes/exons/CDS and other genome paths expected for that gene region? I could get this information from the GFF for that gene and add the genome count.

brettChapman commented 3 years ago

I've now ran with:

vg chunk -x barley_pangenome_1H_augmented.xg -c 20 -p Morex_v2_chr1H:252162-254791 -O pg

I initially used a value for -c based on the gene feature counts and genome count, but the graph was too complex to query efficiently by sequenceTubeMap. I'm now using -c 20 as that is what sequenceTubeMap seems to use by default. I expect running with anything higher than 20 would be wasted, as any variation would then be filtered out by sequenceTubeMap.

I was expecting to see paths from other genomes intersecting with the first few sub-graphs I looked at. I imagine I didn't see these due to the mapping percent identity of 95% I used with PGGB. After processing several more genes, I'm now seeing a lot more variation even with other genes and genome regions being pulled into the alignment as well due to high homology with other regions across the pangenome. Prior to running chunk on the whole pangenome graph, I was simply aligning select segments of the genome from identified orthologous genes, which produced simple graphs and was based on the known divergence of the gene in question. These sub-graphs of the whole graph can be a lot more complex than simple graphs produced on single gene regions.

brettChapman commented 3 years ago

I may have to lower the -c value from 20. Some sub-graphs are still too complex for sequenceTubeMap to query. I have one sub-graph with >200K paths. I posted an issue to sequenceTubeMap the other day, asking about pre-caching the path list for each graph, instead of calling vg paths each time, to speed up the response time and avoid timeout errors: https://github.com/vgteam/sequenceTubeMap/issues/124