Open brettChapman opened 3 years ago
Hi Brett,
I'm not sure how suitable it would be for your workflow with vg/sequenceTubeMap
, but you could try odgi extract
to extract subgraphs. Its interface is very similar to that of vg find/chunk
.
Assuming you have converted your graphs to the odgi
format, a basic command would be:
odgi extract -i barley_pangenome_1H_augmented.og -t 16 -r Morex_v2_chr1H:252162-254791 -o barley_pangenome_1H_augmented_Morex_v2_chr1H_252162-254791.og
This would extract all nodes crossed by the Morex_v2_chr1H
path in the 252162-254791
range, and the portions of the other paths that traverse those nodes.
If your graph is not particularly complex or crazy, you can add -E/--full-range
to the command to extract a more dense piece: the Morex_v2_chr1H:252162-254791
path range will define a node_id_min
and a node_id_max
, and -E
will force to collect all nodes in the sorted order of the graph between these min and max. This assumes that the graph is well sorted and that the node ID space is compact.
odgi extract
is pretty fast, but I want to point out that it works without indexing the graph, so it may not be suitable if you have to make a lot of consecutive extractions.
Thanks @AndreaGuarracino I'll give odgi extract a shot. My graph is a smoothed graph outputted from PGGB. I've embedded genes into the graph from the GFF of Morex v2. I would have to iterate over every gene coordinate to generate many sub-graphs. I'll give it a try anyway, and see how the resulting graphs look. I'll provide an update after I've tried it out.
vg chunk
should work fine for this. A couple of things:
-c 0
. as this means it will extract only the linear path range you specify. the higher you set -c
, the more variation it will pull in from outside the path-T
option tells chunk to pull in paths from a GBWT. if you have a GBWT, you need to specify it with -G
, otherwise probably best to leave out the -T
.Otherwise, chunk
will pull out every other path in your graph that touches any node in the subgraph it returns. If you have a path in your xg
that touches a node in the graph that chunk
outputs, but this path does not appear in the output graph, please report a bug.
One last thing is that chunk
can change the names of all paths but the one passed with -p
. They will get suffixes in []
reflecting the fact that they are fragments.
Thanks @glennhickey, I'll have another shot at vg chunk. In sequenceTubeMap vg chunk was using -c 20. I assumed it meant 20 nodes either side of the coordinates. I didn't realise it would also include nodes outside the path.
I set -T because it was also set by sequenceTubeMap when it ran vg chunk. I don't have any GBWT files, only my indexed graph.
Yes, I've noticed when only partial paths are in the graph, they have a suffix of [].
Which value should I use for -c? Should I define -c as the number of genes/exons/CDS and other genome paths expected for that gene region? I could get this information from the GFF for that gene and add the genome count.
I've now ran with:
vg chunk -x barley_pangenome_1H_augmented.xg -c 20 -p Morex_v2_chr1H:252162-254791 -O pg
I initially used a value for -c based on the gene feature counts and genome count, but the graph was too complex to query efficiently by sequenceTubeMap. I'm now using -c 20 as that is what sequenceTubeMap seems to use by default. I expect running with anything higher than 20 would be wasted, as any variation would then be filtered out by sequenceTubeMap.
I was expecting to see paths from other genomes intersecting with the first few sub-graphs I looked at. I imagine I didn't see these due to the mapping percent identity of 95% I used with PGGB. After processing several more genes, I'm now seeing a lot more variation even with other genes and genome regions being pulled into the alignment as well due to high homology with other regions across the pangenome. Prior to running chunk on the whole pangenome graph, I was simply aligning select segments of the genome from identified orthologous genes, which produced simple graphs and was based on the known divergence of the gene in question. These sub-graphs of the whole graph can be a lot more complex than simple graphs produced on single gene regions.
I may have to lower the -c value from 20. Some sub-graphs are still too complex for sequenceTubeMap to query. I have one sub-graph with >200K paths. I posted an issue to sequenceTubeMap the other day, asking about pre-caching the path list for each graph, instead of calling vg paths each time, to speed up the response time and avoid timeout errors: https://github.com/vgteam/sequenceTubeMap/issues/124
Hi
I have a whole chromosome graph from a pangenome alignment which includes paths along different genomes (20). I've embedded genes/exon/CDS into the graph based on a single path (the known and well annotated reference Morex v2). I want to slice up the graph around each of the genes based on their coordinates, but I still want to include all intersecting nodes and paths from all the other genomes to visualise the variations in these regions. My intent is to visualise in sequenceTubeMap each of these sub-graphs. I've tried visualising the whole pangenome graph, but sequenceTubeMap times out because the requests to vg paths and vg chunk take too long to execute, due to the size of the graph and the large number of paths.
I've tried:
But the only paths I find in the graph are from Morex v2 and the embedded gene.
Should I be using vg find instead for such a task?
I'm unsure of what the best approach to take is. I tried following the commands which sequenceTubeMap uses with vg chunk (that I see outputted to terminal), however there must be something else going on behind the scenes which generates the resulting tube map.
Thank you for any help you can provide.