Open sivico26 opened 2 years ago
Hi again, I found a way.
It turned out to be similar to the first try, but it needed a coordinates conversion through odgi position
(see this tutorial for details):
odgi position -t $threads -i $pangenome -G <(echo -e "$node_start\n$node_end") -r $path | cut -f 2 | tail -n +2 | sed -r 's|.*,([0-9]+),.*|\1|' | paste -sd "-" > path_coords
odgi extract -t $threads -P -c 0 -n $node_start -i $pangenome -r $path:$(cat path_coords) -o subgraph.og
odgi paths -t $threads -f -i subgraph.og | grep -A 1 $(cat path_coords) > subseq.fasta
The postprocessing of the result of odgi position
extracts the equivalent path coordinates and glues them conveniently for odgi extract
. The postprocessing of the odgi paths
output selects the region of interest from the other sequences that might be returned (Not sure if it is safe to assume that the coordinates will always match).
This approach required some Unix-dexterity that might make it less discoverable. Also, the approach requires loading the entire graph twice, generating a subgraph, and loading it once. I still think there must be a more straightforward (and efficient) way to do this. If not, I am insisting on upgrading this to a feature request (see Thoughts in the previous comment).
PS: I also discovered that odgi extract
has a -q
option that, if I understood correctly, should allow specifying the node coordinates instead of the path coordinates (thus, making odgi position
unnecessary). However, when I tried to use this option, odgi extract
complained: [odgi::extract] error: please specify valid numbers for the pangenomic range
. I assume it was because in this case $node_start > $node_end
, which clearly does not make sense for a linear reference, but it might for a pangenomic reference. Simply reversing the order of the nodes did not work for me, as the sequences extracted in the end were very different from the ones I was looking for (The region I targeted was not there either).
Regards, Sivico
It seems you want something like this:
echo -e "18\n19\n20" > node_ids_to_extract.txt
echo "gi|568815592:32578768-32589835" > paths_to_extract.txt
odgi extract -i ~/git/odgi/test/DRB1-3123.gfa -d 0 -l node_ids_to_extract.txt -p paths_to_extract.txt -o subgraph.og
-l
specifies the nodes to extract from the input graph;-p
specifies the paths to consider in the extraction.Does this resolve your issue?
Hi @AndreaGuarracino,
I used
odgi paths -t $threads -f -i subgraph.og > seqs.fasta
To get the sequences of the subgraph. After trying your approach (worth mentioning, requires odgi v0.8
), I arrive at the following results.
If I only include node_start
and node_end
in the node_ids_to_extract.txt
, this is the result I got:
>Hcom|1|chr5H:441147985-441148034
GTGGTGAGAGAGATAGAGAGAGCTCTGGTGGGTGGCATTAATGAATGTC
If I include all the nodes between node_start
and node_end
along path
(the only entry in paths_to_extract.txt
), which in this case is a list of 8838 node ids, this results in:
Here we can clearly see that the result is completely fragmented. Furthermore, the extracted bases only sum up to 95807.
To give you an idea, my previous approach gives the following result:
Which is kind of what I expected, a long translocation involving two chromosomes of different species. It might be worth mentioning that this was done with odgi 0.7.3
.
So, unfortunately, no. The results differ vastly.
I also noticed that, even though my query targets a small part of the pangenome (around 0.00212%
of the bases), the size of subgraph.og
is still huge (11G
with my data, for all three cases) compared to the original graph (43G
). I imagine this is due to the fundamental data structure that is supporting all the operations, but it seems like a very unfavorable tradeoff/cost considering that it is just an intermediate file for this query.
Forgive me for insisting, but if we could go from pangenome_graph.og
$\rightarrow$ subsequence.fasta
, bypassing the generation of subgraph.og
that would be ideal. In my mind, the way of doing this would be to give odgi paths
some of the subsetting capabilities that odgi extract
already has (e.g. options -b
, -q
, -r
, etc.). In other words, to be able to extract a subset of a path and directly output a fasta sequence.
Of course, the above is wishful thinking. I have no idea if it is technically feasible. I would love to hear your thoughts about it though. If you need anything else, or if something was not clear, let me know.
Thanks for the support and maintaining odgi
.
Cheers
Hi @sivico26, I understand why this is happening. You need an odgi extract
feature that prevents path fragmentation, a feature that is not triggered when a node list guides the extraction.
Can you give me a (small) concrete example where your approach in here works, but the extraction by a node list leads to a fragmented graph? I mean files, nodes, etc...
I might have an idea to power up odgi
, and I would prefer to use one of your examples, if possible.
Hi @AndreaGuarracino,
As requested here is a use case:
DRB1-3123.gfa
2564
2637
gi|28212470:131613-146345
Using your method with odgi v0.8
I get the following result:
>gi|28212470:131613-146345:7865-7878
GTGAGTCTAGGCC
>gi|28212470:131613-146345:8169-8175
AAAAAA
Using my method with odgi v0.7.3
I get:
>gi|28212470:131613-146345:7865-8169
GTGAGTCTAGGCCGGGCGCGGTGGCTCATGCGTGTAATCCTAGCACTTTGGGAGACCAAAGCGGGTGGATTGCCTGAGCTCAGGAGTTCGAGACCAGCCTGGGCAACATGGTGAAACCCCATCTCTACTAAAATACAAAAAATTAGCCGGGCATGGCAGGGTGCACCTGTAATCCCAGCTACTGGGGAGGCTGAGACAGGAGAATCGCTTGAACCTGGGAGGCAGAGGTTGTAGTGAGCTGAGATGGAGCCACTGCACTCCAGTCTGGGCGACAGAGCAAGACCCCATCTCAAAAAAAAAAAAA
My method breaks with odgi v0.8
, I imagine due to the change in some options of the command line. Anyway, that goes beside the point, it works with v0.7.3
.
I hope this case is useful enough. Let me know if I can be of further help.
Hello there, thanks for developing
odgi
I want to make a query to my pangenome graph and I thought I could do it with
odgi
, but I have struggled to find a way of doing it.The query should be rather simple: I want to extract the sequence of a
path
from the nodenode_start
to the nodenode_end
. I know beforehand thatnode_start
andnode_end
define a contiguous segment onpath
. Bear in mind that I do not the coordinates of this segment in base space, only in node space.This is what I have tried:
First try
I tried to guide myself from the extraction tutorial
The result, unfortunately, has several paths (not only the one I am interested in) but more importantly, the sequence is way larger than I expected. I realized that
-r
might actually be expecting (correct me if I am wrong) base space coordinates, relative to the specified path.Second try
Similarly, I tried to combine
-l
and-p
parameters with extract:I thought this one would make the trick but the resulting subgraph did not include any paths! Only Segments and Links so, I could not extract any sequence.
Thoughts
I imagine there should be a better approach. Please let me know if I am missing it,
I do feel this query should be quite straightforward from a
.gfa
input. I also feel (but maybe I am being impatient) that odgi extract makes a lot of computations that might not be necesary for this particular query. In my mind, it could be as simple as concatenating the segments defined in that range following the path, and this should be possible to do inO(n)
maybe with an index. I could see the situation can get more problematic in cases where either the start or the end nodes are traversed multiple times (adding ambiguity to the desired range).Ideally, I imagine myself doing something like this:
Anyway this are just some thoughts.
I would appreciate to hear your thoughts on the best way of doing this. If there is not a goo way, maybe we can consider it a feature request.
Regards Sivico