pangenome / odgi

Optimized Dynamic Genome/Graph Implementation: understanding pangenome graphs
https://doi.org/10.1093/bioinformatics/btac308
MIT License
196 stars 40 forks source link

odgi inject with overlapping bed entries #475

Closed henrivkgt closed 1 year ago

henrivkgt commented 1 year ago

Hello,

Recently I have run into an issue with odgi inject when trying to inject genes and exons as paths in a graph. When running odgi inject using a bed file with overlapping coordinates (for instance when adding two transcript variants of the same gene), only the second entry is added correctly. The first entry can still be found in the graph when running odgi paths, but it is given a length of 0, and using odgi extract on the corresponding region does not extract the path. Is this intended? My current workaround is to separate the bed file into multiple files with only non-overlapping entries, then running odgi inject on all of them adding the paths recursively to the graph, but it is a bit inefficient as the graph has to be completely written to disk every time.

Best, Henri

AndreaGuarracino commented 1 year ago

I can reproduce the issue. To me, it is triggered when two intervals share the same end coordinates:

echo -e "x\t1\t2\tA\nx\t1\t2\tB"
x   1   2   A
x   1   2   B
odgi inject -i test/t.gfa -b <(echo -e "x\t1\t2\tA\nx\t1\t2\tB") -o - | odgi view -i - -g | tail -n 3
P   x   1+,2+,3+,5+,7+,8+,10+,11+,13+,14+,16+,17+   *
P   A       *
P   B   2+  *

echo -e "x\t1\t2\tA\nx\t1\t2\tB"
x   1   2   A
x   1   3   B
odgi inject -i test/t.gfa -b <(echo -e "x\t1\t2\tA\nx\t1\t3\tB") -o - | odgi view -i - -g | tail -n 3
P   x   1+,2+,3+,4+,6+,8+,9+,11+,12+,14+,15+,17+,18+    *
P   A   2+  *
P   B   2+,3+   *

The issue is that paths are created, but no steps are added to them.

henrivkgt commented 1 year ago

I checked with my graph and indeed it only happens when two bed entries share the same end coordinate.

AndreaGuarracino commented 1 year ago

Thank you for confirming that!

Still investigating, and things are going worse.

cat t.gfa
H   VN:Z:1.0
S   1   ATCG
P   x   1+   *

odgi inject -i t.gfa -b <(echo -e "x\t1\t2\tA\nx\t1\t3\tB") -o - | odgi view -i - -g
H       VN:Z:1.0
S       1       A
L       1       +       2       +       0M
S       2       T
L       2       +       3       +       0M
S       3       C
L       3       +       4       +       0M
S       4       G
P       x       4-,3-,2-,1-     *
P       A       3-      *
P       B       3-,2-   *

odgi inject -i t.gfa -b <(echo -e "x\t1\t2\tA\nx\t1\t3\tB") -o - | odgi paths -i - -f
>x
CGAT
>A
G
>B
GA
AndreaGuarracino commented 1 year ago

Confusing (@ekg):

cat t.gfa
H   VN:Z:1.0
S   1   ATCG
P   x   1+   *

odgi view -i t.gfa -g
H       VN:Z:1.0
S       1       ATCG
P       x       1-      *
AndreaGuarracino commented 1 year ago

The last two strange behaviors were triggered by the presence of spaces instead of a TAB between 1+ and *, so it is all good(?).

@henrivkgt, I think I've fixed the issue in https://github.com/pangenome/odgi/pull/476. Could you please give it a try?

henrivkgt commented 1 year ago

I built the odgi from that branch, but running the command with this build gives me an error: [odgi::algorithms::inject_ranges] injection end point for interval 97103_geno_T0000001_exon_1 is not at a node boundary: off by 2305 vs 5796 on a node 260bp long

The offending line seems normal: 97103_genome_v2.5~1~1 2024 2305 97103_geno_T0000001_exon_1

with odgi from the release the command finishes without errors (but with the path bug)

AndreaGuarracino commented 1 year ago

Thank you for the prompt check. So there is still work to do.

Just in case it might be useful, could you please share your GFA and BED, or small parts of them that trigger the error?

henrivkgt commented 1 year ago

watermelon6_100kb.tar.gz Sure, I made a small example that triggers the error for me

AndreaGuarracino commented 1 year ago

I debugged a bit more and catched the error. I've tested the update also with your example too. Now it should be the good one!

henrivkgt commented 1 year ago

Seems good now. Thanks for fixing this!

AndreaGuarracino commented 1 year ago

Thank you for your feedback!