pangenome / odgi

Optimized Dynamic Genome/Graph Implementation: understanding pangenome graphs
https://doi.org/10.1093/bioinformatics/btac308
MIT License
194 stars 39 forks source link

Position command doesn't output all paths #440

Open ASLeonard opened 2 years ago

ASLeonard commented 2 years ago

This is all done using the latest docker image.

I'm trying to lift coordinates through a pggb graph from one assembly to another with odgi position. This works fine for a given path (where regions.bed contains coordinates for ASM_1)

odgi position -i pggb.gfa -b regions.bed -r ASM_2

However, the documentation leads me to believe using -R ref_paths.txt containing ASM_2, ASM_3, ASM_4, etc. would then give output for every bed position for every ref_path in the -R file. However, it still only takes the first entry from the ref_paths. Similarly in the source code for odgi position, it seems to suggest not providing any ref_path name or file should default to using all paths in the graph https://github.com/pangenome/odgi/blob/bd54ab80ea03d864c82bd05599bf1dfa1777a72f/src/subcommand/position_main.cpp#L137-L139

Am I misunderstanding the approach here? Currently the working solution is just to invoke N calls of odgi position to liftover the entire bed file for the N other assemblies, but that seems inefficient.

Thanks, Alex

AndreaGuarracino commented 2 years ago

Hi @ASLeonard, would you please provide a (little if possible) reproducible example of your problem?

ASLeonard commented 2 years ago

Even a subsetted og file was quite large, so if you need it I can provide it elsewhere. But the steps are

odgi paths -i pggb.smooth.og -L -l
#UCD    1   51098607
#Angus  1   51138358
#Highland   1   51174331
#OBV    1   52178328
#Brahman    1   51254507

cat breeds.txt
#UCD
#Angus
#Highland
#OBV
#Brahman

cat regions.bed
#UCD     1000   10000
#UCD     200000  3000000

odgi position -i pggb.smooth.og -R breeds.txt -b regions.bed
#UCD    1000    10000   UCD,1000,+     UCD,10000,+  +
#UCD    200000  3000000 UCD,200000,+    UCD,3000000,+   +

Where I would expect the additional lines

...
#UCD    200000  3000000 Highland,422079,+   Highland,3211922,+  + 
...

which I can get by manually running

for breed in UCD Angus Highland...
do
  odgi position -i pggb.smooth.og -r $breed -b regions.bed
done
ASLeonard commented 2 years ago

Was looking a bit more into this but didn't make much progress. But here is some more information

If you don't provide -r/-R, all paths do seem to be available (i.e. ref_path_set=#paths). However, there are still only 2 * number of bed lines calls (one for start and one for end) of get_position, which seems to produce a single line of output. I'm not very clear on the step_handle/path_handle, so I'm not sure if it is even possible for a single call of get_position to generate the bed coordinate lifted into all reference paths.

If I commented out this line https://github.com/pangenome/odgi/blob/984aee9d2d0649cc73d0151de3c268ade2988958/src/subcommand/position_main.cpp#L651

I could at least get all the path names printed, but probably would require some work to get that as printing out the lifted over positions for them all.