Open subwaystation opened 1 year ago
This is becoming a serious problem now @AndreaGuarracino. I am extraction some full range paths and want to sort and visualize the graph. However, the nodes and edges which are not visited by any path are disturbing the 1D visualization.
wget https://s3-us-west-2.amazonaws.com/human-pangenomics/pangenomes/freeze/freeze1/pggb/chroms/chr6.hprc-v1.0-pggb.gfa.gz
gunzip chr6.hprc-v1.0-pggb.gfa.gz
odgi build -g chr6.hprc-v1.0-pggb.gfa -o chr6.hprc-v1.0-pggb.og -P -t 28
odgi paths -i chr6.hprc-v1.0-pggb.og -f > chr6.hprc-v1.0-pggb.fasta
samtools faidx chr6.hprc-v1.0-pggb.fasta
grep 'chm13#chr6\|grch38#chr6\|HG01928#2\|HG01071#2\|HG01106#2\|HG02622#2' chr6.hprc-v1.0-pggb.fasta.fai | cut -f 1 > 18_paths_to_extract.txt
grep 'chm13#chr6\|grch38#chr6\|HG01928#2\|HG01071#2\|HG01106#2\|HG02622#2' chr6.hprc-v1.0-pggb.fasta.fai | cut -f 1,2 | sed 's/\t/\t0\t/g' > 18_paths_to_extract.bed
odgi extract -i chr6.hprc-v1.0-pggb.og -o chr6.hprc-v1.0-pggb.18paths.og -E -b 18_paths_to_extract.bed -P -p 18_paths_to_extract.txt -t 28
odgi viz -i chr6.hprc-v1.0-pggb.18paths.og -o chr6.hprc-v1.0-pggb.18paths.og.png
bedtools merge -i /mnt/vol1/software/odgi/git/master/test/chr6.HLA_genes.bed -d 10000000 > MHC.bed
echo -e '\tMHC' >> MHC.bed
cat MHC.bed | tr '\n' ' ' | sed 's/ //g' > MHC.final.bed
odgi procbed -i chr6.hprc-v1.0-pggb.18paths.og -b MHC.final.bed > MHC.adj.bed
odgi inject -i chr6.hprc-v1.0-pggb.18paths.og -b MHC.adj.bed -o chr6.hprc-v1.0-pggb.18paths.MHC.og -P -t 28
odgi viz -i chr6.hprc-v1.0-pggb.18paths.MHC.og -o chr6.hprc-v1.0-pggb.18paths.MHC.og.png
odgi sort -i chr6.hprc-v1.0-pggb.18paths.MHC.og -r -P -t 28 -o chr6.hprc-v1.0-pggb.18paths.MHC.og.r.og
odgi viz -i chr6.hprc-v1.0-pggb.18paths.MHC.og.r.og -o chr6.hprc-v1.0-pggb.18paths.MHC.og.r.og.png -u -d
odgi sort -i chr6.hprc-v1.0-pggb.18paths.MHC.og.r.og -o chr6.hprc-v1.0-pggb.18paths.MHC.og.r.og.Y.og -p Y -t 28 -P
odgi viz -i chr6.hprc-v1.0-pggb.18paths.MHC.og.r.og.Y.og -o chr6.hprc-v1.0-pggb.18paths.MHC.og.r.og.Y.og.png -u -d
odgi paths -i chr6.hprc-v1.0-pggb.18paths.MHC.og -L | grep "grch" > chr6.grch38.MHC.name
odgi sort -i chr6.hprc-v1.0-pggb.18paths.MHC.og.r.og -o chr6.hprc-v1.0-pggb.18paths.MHC.og.r.og.HY.og -P -t 28 -H chr6.grch38.MHC.name -Y
odgi viz -i chr6.hprc-v1.0-pggb.18paths.MHC.og.r.og.HY.og -o chr6.hprc-v1.0-pggb.18paths.MHC.og.r.og.HY.og.png -d -u
default viz random sort PG-SGD RPG-SGD
Could you please fix this or tell me where to change what in the code @AndreaGuarracino. Thanks!
panacus table -c nodes -a chr6.hprc-v1.0-pggb.18paths.MHC.gfa > chr6.hprc-v1.0-pggb.18paths.MHC.gfa.nodes.txt
wc -l chr6.hprc-v1.0-pggb.18paths.MHC.gfa.nodes.txt
4699147
grep -P "\t0" chr6.hprc-v1.0-pggb.18paths.MHC.gfa.nodes.txt | wc -l
1248659
Around 1/4 of nodes are not visited by any path.
panacus table -c edges -a chr6.hprc-v1.0-pggb.18paths.MHC.gfa > chr6.hprc-v1.0-pggb.18paths.MHC.gfa.edges.txt
wc -l chr6.hprc-v1.0-pggb.18paths.MHC.gfa.edges.txt
6552109
grep -P "\t0" chr6.hprc-v1.0-pggb.18paths.MHC.gfa.edges.txt | wc -l
2554251
Around 1/3 of edges are not followed by any path.
Maybe this is related to how you fixed this problem in smoothxg
@AndreaGuarracino ?
This is a problem of odgi extract
interface that is not super intuitive. You want
--paths-to-extract=[FILE] List of paths to keep in the extracted
graph. The FILE must contain one path
name per line and a subset of all
paths can be specified. Paths
specified in the input path ranges
(with -r/--path-range and/or
-b/--bed-file) will be kept in any
case.
so try
odgi extract ... -p <(cut -f 1 18_paths_to_extract.bed | sort | uniq)
I already am using -p
, please take an exact look at my command above.
odgi extract -i chr6.hprc-v1.0-pggb.og -o chr6.hprc-v1.0-pggb.tomato.og -P -p 18_paths_to_extract.txt -t 28
ls -lisah chr6.hprc-v1.0-pggb.tomato.og
192399413 4.0K -rw-rw-r-- 1 ubuntu ubuntu 60 Aug 28 14:35 chr6.hprc-v1.0-pggb.tomato.og
cat 18_paths_to_extract.txt
chm13#chr6
grch38#chr6
HG01071#2#JAHBCE010000006.1
HG01071#2#JAHBCE010000042.1
HG01071#2#JAHBCE010000076.1
HG01071#2#JAHBCE010000081.1
HG01106#2#JAHAMB010000019.1
HG01106#2#JAHAMB010000032.1
HG01106#2#JAHAMB010000054.1
HG01106#2#JAHAMB010000082.1
HG01928#2#JAGYVP010000017.1
HG01928#2#JAGYVP010000027.1
HG01928#2#JAGYVP010000201.1
HG02622#2#JAHAON010000002.1
HG02622#2#JAHAON010000041.1
HG02622#2#JAHAON010000058.1
HG02622#2#JAHAON010000190.1
HG02622#2#JAHAON010000229.1
As expected, your suggestion only gives me an empty graph.
Can you also try without -E
?
About the edges not supported by any path, yes, it is related to a recent bug I've fixed in smoothxg
(https://github.com/pangenome/smoothxg/pull/194).
Leaving out -E
does the trick. Thanks. Leaving this open so you can fix the bug.
When extracting a subgraph using a BED file via
-b
the resulting subgraph contains all BED range(s) specified plus all nodes between these ranges which are not touched by any path! It would be nice to have an additional flag which allows for the removal of nodes and edges that are not touched by any path.