pangenome / odgi

Optimized Dynamic Genome/Graph Implementation: understanding pangenome graphs
https://doi.org/10.1093/bioinformatics/btac308
MIT License
191 stars 39 forks source link

`odgi extract` - please remove nodes that are not in specified BED range(s) #526

Open subwaystation opened 1 year ago

subwaystation commented 1 year ago

When extracting a subgraph using a BED file via -b the resulting subgraph contains all BED range(s) specified plus all nodes between these ranges which are not touched by any path! It would be nice to have an additional flag which allows for the removal of nodes and edges that are not touched by any path.

subwaystation commented 1 year ago

This is becoming a serious problem now @AndreaGuarracino. I am extraction some full range paths and want to sort and visualize the graph. However, the nodes and edges which are not visited by any path are disturbing the 1D visualization.

wget https://s3-us-west-2.amazonaws.com/human-pangenomics/pangenomes/freeze/freeze1/pggb/chroms/chr6.hprc-v1.0-pggb.gfa.gz
gunzip chr6.hprc-v1.0-pggb.gfa.gz
odgi build -g chr6.hprc-v1.0-pggb.gfa -o chr6.hprc-v1.0-pggb.og -P -t 28
odgi paths -i chr6.hprc-v1.0-pggb.og -f > chr6.hprc-v1.0-pggb.fasta
samtools faidx chr6.hprc-v1.0-pggb.fasta
grep 'chm13#chr6\|grch38#chr6\|HG01928#2\|HG01071#2\|HG01106#2\|HG02622#2' chr6.hprc-v1.0-pggb.fasta.fai | cut -f 1 > 18_paths_to_extract.txt
grep 'chm13#chr6\|grch38#chr6\|HG01928#2\|HG01071#2\|HG01106#2\|HG02622#2' chr6.hprc-v1.0-pggb.fasta.fai | cut -f 1,2 | sed 's/\t/\t0\t/g' > 18_paths_to_extract.bed
odgi extract -i chr6.hprc-v1.0-pggb.og -o chr6.hprc-v1.0-pggb.18paths.og -E -b 18_paths_to_extract.bed -P -p 18_paths_to_extract.txt -t 28
odgi viz -i chr6.hprc-v1.0-pggb.18paths.og -o chr6.hprc-v1.0-pggb.18paths.og.png
bedtools merge -i /mnt/vol1/software/odgi/git/master/test/chr6.HLA_genes.bed -d 10000000 > MHC.bed
echo -e '\tMHC' >> MHC.bed
cat MHC.bed | tr '\n' ' ' | sed 's/ //g' > MHC.final.bed
odgi procbed -i chr6.hprc-v1.0-pggb.18paths.og -b MHC.final.bed > MHC.adj.bed
odgi inject -i chr6.hprc-v1.0-pggb.18paths.og -b MHC.adj.bed -o chr6.hprc-v1.0-pggb.18paths.MHC.og -P -t 28
odgi viz -i chr6.hprc-v1.0-pggb.18paths.MHC.og -o chr6.hprc-v1.0-pggb.18paths.MHC.og.png
odgi sort -i chr6.hprc-v1.0-pggb.18paths.MHC.og -r -P -t 28 -o chr6.hprc-v1.0-pggb.18paths.MHC.og.r.og
odgi viz -i chr6.hprc-v1.0-pggb.18paths.MHC.og.r.og -o chr6.hprc-v1.0-pggb.18paths.MHC.og.r.og.png -u -d
odgi sort -i chr6.hprc-v1.0-pggb.18paths.MHC.og.r.og -o chr6.hprc-v1.0-pggb.18paths.MHC.og.r.og.Y.og -p Y -t 28 -P
odgi viz -i chr6.hprc-v1.0-pggb.18paths.MHC.og.r.og.Y.og -o chr6.hprc-v1.0-pggb.18paths.MHC.og.r.og.Y.og.png -u -d
odgi paths -i chr6.hprc-v1.0-pggb.18paths.MHC.og -L | grep "grch" > chr6.grch38.MHC.name
odgi sort -i chr6.hprc-v1.0-pggb.18paths.MHC.og.r.og -o chr6.hprc-v1.0-pggb.18paths.MHC.og.r.og.HY.og -P -t 28 -H chr6.grch38.MHC.name -Y
odgi viz -i chr6.hprc-v1.0-pggb.18paths.MHC.og.r.og.HY.og -o chr6.hprc-v1.0-pggb.18paths.MHC.og.r.og.HY.og.png -d -u

default viz chr6 hprc-v1 0-pggb 18paths MHC og random sort chr6 hprc-v1 0-pggb 18paths MHC og r og PG-SGD chr6 hprc-v1 0-pggb 18paths MHC og r og Y og RPG-SGD chr6 hprc-v1 0-pggb 18paths MHC og r og HY og

Could you please fix this or tell me where to change what in the code @AndreaGuarracino. Thanks!

subwaystation commented 1 year ago
panacus table -c nodes -a chr6.hprc-v1.0-pggb.18paths.MHC.gfa > chr6.hprc-v1.0-pggb.18paths.MHC.gfa.nodes.txt
wc -l chr6.hprc-v1.0-pggb.18paths.MHC.gfa.nodes.txt
4699147
grep -P "\t0" chr6.hprc-v1.0-pggb.18paths.MHC.gfa.nodes.txt | wc -l
1248659

Around 1/4 of nodes are not visited by any path.

subwaystation commented 1 year ago
panacus table -c edges -a chr6.hprc-v1.0-pggb.18paths.MHC.gfa > chr6.hprc-v1.0-pggb.18paths.MHC.gfa.edges.txt
wc -l chr6.hprc-v1.0-pggb.18paths.MHC.gfa.edges.txt
6552109
grep -P "\t0" chr6.hprc-v1.0-pggb.18paths.MHC.gfa.edges.txt | wc -l
2554251

Around 1/3 of edges are not followed by any path.

subwaystation commented 1 year ago

Maybe this is related to how you fixed this problem in smoothxg @AndreaGuarracino ?

AndreaGuarracino commented 1 year ago

This is a problem of odgi extract interface that is not super intuitive. You want

        --paths-to-extract=[FILE]         List of paths to keep in the extracted
                                          graph. The FILE must contain one path
                                          name per line and a subset of all
                                          paths can be specified. Paths
                                          specified in the input path ranges
                                          (with -r/--path-range and/or
                                          -b/--bed-file) will be kept in any
                                          case.

so try

odgi extract ... -p <(cut -f 1 18_paths_to_extract.bed | sort | uniq)

subwaystation commented 1 year ago

I already am using -p, please take an exact look at my command above.

odgi extract -i chr6.hprc-v1.0-pggb.og -o chr6.hprc-v1.0-pggb.tomato.og -P -p 18_paths_to_extract.txt -t 28
ls -lisah chr6.hprc-v1.0-pggb.tomato.og
192399413 4.0K -rw-rw-r-- 1 ubuntu ubuntu 60 Aug 28 14:35 chr6.hprc-v1.0-pggb.tomato.og
cat 18_paths_to_extract.txt
chm13#chr6
grch38#chr6
HG01071#2#JAHBCE010000006.1
HG01071#2#JAHBCE010000042.1
HG01071#2#JAHBCE010000076.1
HG01071#2#JAHBCE010000081.1
HG01106#2#JAHAMB010000019.1
HG01106#2#JAHAMB010000032.1
HG01106#2#JAHAMB010000054.1
HG01106#2#JAHAMB010000082.1
HG01928#2#JAGYVP010000017.1
HG01928#2#JAGYVP010000027.1
HG01928#2#JAGYVP010000201.1
HG02622#2#JAHAON010000002.1
HG02622#2#JAHAON010000041.1
HG02622#2#JAHAON010000058.1
HG02622#2#JAHAON010000190.1
HG02622#2#JAHAON010000229.1

As expected, your suggestion only gives me an empty graph.

AndreaGuarracino commented 1 year ago

Can you also try without -E?

AndreaGuarracino commented 1 year ago

About the edges not supported by any path, yes, it is related to a recent bug I've fixed in smoothxg (https://github.com/pangenome/smoothxg/pull/194).

subwaystation commented 1 year ago

Leaving out -E does the trick. Thanks. Leaving this open so you can fix the bug.