pangenome / odgi

Optimized Dynamic Genome/Graph Implementation: understanding pangenome graphs
https://doi.org/10.1093/bioinformatics/btac308
MIT License
194 stars 39 forks source link

Extracting a sequence defined by a range on a path in the node space #447

Open sivico26 opened 2 years ago

sivico26 commented 2 years ago

Hello there, thanks for developing odgi

I want to make a query to my pangenome graph and I thought I could do it with odgi, but I have struggled to find a way of doing it.

The query should be rather simple: I want to extract the sequence of a path from the node node_start to the node node_end. I know beforehand that node_start and node_end define a contiguous segment on path. Bear in mind that I do not the coordinates of this segment in base space, only in node space.

This is what I have tried:

First try

I tried to guide myself from the extraction tutorial

## Extract subgraph
odgi extract -t $threads -P -c 0 -n $node_start -i $pangenome -r $path:$node_start-$node_end -o subgraph.og
## Extract sub-sequence
odgi paths -t $threads -f -i subgraph.og > subseq.fasta

The result, unfortunately, has several paths (not only the one I am interested in) but more importantly, the sequence is way larger than I expected. I realized that -r might actually be expecting (correct me if I am wrong) base space coordinates, relative to the specified path.

Second try

Similarly, I tried to combine -l and -p parameters with extract:

odgi extract -t $threads -P -c 0 -n $node_start -i $pangenome -p <(echo $path) -l <(seq $node_start $node_end) -o subgraph.og

I thought this one would make the trick but the resulting subgraph did not include any paths! Only Segments and Links so, I could not extract any sequence.

Thoughts

I imagine there should be a better approach. Please let me know if I am missing it,

I do feel this query should be quite straightforward from a .gfa input. I also feel (but maybe I am being impatient) that odgi extract makes a lot of computations that might not be necesary for this particular query. In my mind, it could be as simple as concatenating the segments defined in that range following the path, and this should be possible to do in O(n) maybe with an index. I could see the situation can get more problematic in cases where either the start or the end nodes are traversed multiple times (adding ambiguity to the desired range).

Ideally, I imagine myself doing something like this:

odgi paths -n -f -p $path:$node_start-$node_end > subseq.fasta 
## -n would denote the node space coordinates
## -p would specify the path and coordinates to extract

Anyway this are just some thoughts.

I would appreciate to hear your thoughts on the best way of doing this. If there is not a goo way, maybe we can consider it a feature request.

Regards Sivico

sivico26 commented 2 years ago

Hi again, I found a way.

It turned out to be similar to the first try, but it needed a coordinates conversion through odgi position (see this tutorial for details):

odgi position -t $threads -i $pangenome -G <(echo -e "$node_start\n$node_end") -r $path | cut -f 2 | tail -n +2 | sed -r 's|.*,([0-9]+),.*|\1|' | paste -sd "-" > path_coords
odgi extract -t $threads -P -c 0 -n $node_start -i $pangenome -r $path:$(cat path_coords) -o subgraph.og
odgi paths -t $threads -f -i subgraph.og | grep -A 1 $(cat path_coords) > subseq.fasta

The postprocessing of the result of odgi position extracts the equivalent path coordinates and glues them conveniently for odgi extract. The postprocessing of the odgi paths output selects the region of interest from the other sequences that might be returned (Not sure if it is safe to assume that the coordinates will always match).

This approach required some Unix-dexterity that might make it less discoverable. Also, the approach requires loading the entire graph twice, generating a subgraph, and loading it once. I still think there must be a more straightforward (and efficient) way to do this. If not, I am insisting on upgrading this to a feature request (see Thoughts in the previous comment).

PS: I also discovered that odgi extract has a -q option that, if I understood correctly, should allow specifying the node coordinates instead of the path coordinates (thus, making odgi position unnecessary). However, when I tried to use this option, odgi extract complained: [odgi::extract] error: please specify valid numbers for the pangenomic range. I assume it was because in this case $node_start > $node_end, which clearly does not make sense for a linear reference, but it might for a pangenomic reference. Simply reversing the order of the nodes did not work for me, as the sequences extracted in the end were very different from the ones I was looking for (The region I targeted was not there either).

Regards, Sivico

AndreaGuarracino commented 1 year ago

It seems you want something like this:

echo -e "18\n19\n20" > node_ids_to_extract.txt
echo "gi|568815592:32578768-32589835" > paths_to_extract.txt
odgi extract -i ~/git/odgi/test/DRB1-3123.gfa -d 0 -l node_ids_to_extract.txt -p paths_to_extract.txt -o subgraph.og

Does this resolve your issue?

sivico26 commented 1 year ago

Hi @AndreaGuarracino,

I used

odgi paths -t $threads -f -i subgraph.og > seqs.fasta

To get the sequences of the subgraph. After trying your approach (worth mentioning, requires odgi v0.8), I arrive at the following results.

If I only include node_start and node_end in the node_ids_to_extract.txt, this is the result I got:

>Hcom|1|chr5H:441147985-441148034
GTGGTGAGAGAGATAGAGAGAGCTCTGGTGGGTGGCATTAATGAATGTC

If I include all the nodes between node_start and node_end along path (the only entry in paths_to_extract.txt), which in this case is a list of 8838 node ids, this results in:

seqs.fasta.txt

Here we can clearly see that the result is completely fragmented. Furthermore, the extracted bases only sum up to 95807.

To give you an idea, my previous approach gives the following result:

trans.fasta.txt

Which is kind of what I expected, a long translocation involving two chromosomes of different species. It might be worth mentioning that this was done with odgi 0.7.3.

So, unfortunately, no. The results differ vastly.

I also noticed that, even though my query targets a small part of the pangenome (around 0.00212% of the bases), the size of subgraph.og is still huge (11G with my data, for all three cases) compared to the original graph (43G). I imagine this is due to the fundamental data structure that is supporting all the operations, but it seems like a very unfavorable tradeoff/cost considering that it is just an intermediate file for this query.

Forgive me for insisting, but if we could go from pangenome_graph.og $\rightarrow$ subsequence.fasta, bypassing the generation of subgraph.og that would be ideal. In my mind, the way of doing this would be to give odgi paths some of the subsetting capabilities that odgi extract already has (e.g. options -b, -q, -r, etc.). In other words, to be able to extract a subset of a path and directly output a fasta sequence.

Of course, the above is wishful thinking. I have no idea if it is technically feasible. I would love to hear your thoughts about it though. If you need anything else, or if something was not clear, let me know.

Thanks for the support and maintaining odgi. Cheers

AndreaGuarracino commented 1 year ago

Hi @sivico26, I understand why this is happening. You need an odgi extract feature that prevents path fragmentation, a feature that is not triggered when a node list guides the extraction.

Can you give me a (small) concrete example where your approach in here works, but the extraction by a node list leads to a fragmented graph? I mean files, nodes, etc...

I might have an idea to power up odgi, and I would prefer to use one of your examples, if possible.

sivico26 commented 1 year ago

Hi @AndreaGuarracino,

As requested here is a use case:

Using your method with odgi v0.8 I get the following result:

>gi|28212470:131613-146345:7865-7878
GTGAGTCTAGGCC
>gi|28212470:131613-146345:8169-8175
AAAAAA

Using my method with odgi v0.7.3 I get:

>gi|28212470:131613-146345:7865-8169
GTGAGTCTAGGCCGGGCGCGGTGGCTCATGCGTGTAATCCTAGCACTTTGGGAGACCAAAGCGGGTGGATTGCCTGAGCTCAGGAGTTCGAGACCAGCCTGGGCAACATGGTGAAACCCCATCTCTACTAAAATACAAAAAATTAGCCGGGCATGGCAGGGTGCACCTGTAATCCCAGCTACTGGGGAGGCTGAGACAGGAGAATCGCTTGAACCTGGGAGGCAGAGGTTGTAGTGAGCTGAGATGGAGCCACTGCACTCCAGTCTGGGCGACAGAGCAAGACCCCATCTCAAAAAAAAAAAAA

My method breaks with odgi v0.8, I imagine due to the change in some options of the command line. Anyway, that goes beside the point, it works with v0.7.3.

I hope this case is useful enough. Let me know if I can be of further help.