pangenome / odgi

Optimized Dynamic Genome/Graph Implementation: understanding pangenome graphs
https://doi.org/10.1093/bioinformatics/btac308
MIT License
191 stars 39 forks source link

`odgi extract` is frustrating #478

Open subwaystation opened 1 year ago

subwaystation commented 1 year ago

Hi @AndreaGuarracino,

I was trying to extract grch38#chr1:13104252-13122521 from the Chr1 HPRC pangenome graph. However, I ran into lot's of trials and errors until I somehow got what I wanted. Surprisngly, the odgi extract output occupied quite some disk space sometimes. More details below.

Fetching the data

wget https://s3-us-west-2.amazonaws.com/human-pangenomics/pangenomes/freeze/freeze1/pggb/chroms/chr1.hprc-v1.0-pggb.gfa.gz
gunzip chr1.hprc-v1.0-pggb.gfa.gz
odgi build -g chr1.hprc-v1.0-pggb.gfa -o chr1.hprc-v1.0-pggb.gfa.og -P -t 16

1st round

odgi extract -i chr1.hprc-v1.0-pggb.gfa.og -o chr1.test.d100.og -t 16 -P -r grch38#chr1:13104252-13122521 -d 100
du -h chr1.test.d100.og 
973M    chr1.test.d100.og
odgi sort -i chr1.test.d100.og -o chr1.test.d100.og.O -O -P -t 16
du -h chr1.test.d100.og.O
864K    chr1.test.d100.og.O
odgi viz -i chr1.test.d100.og.O -o chr1.test.d100.og.O.png

How can it be that before optimization the graph occupies ~1GB on disk? Let's take a look at the actual subgraph we got. chr1 test d100 og O

This doesn't look so bad, but all paths are somehow scattered, so we miss the big picture, therefore I did lot's of trials to understand what's going on.

2nd round

odgi extract -i chr1.hprc-v1.0-pggb.gfa.og -o chr1.test.c1000.og -t 16 -P -r grch38#chr1:13104252-13122521 -c 1000 -d 0
du -h chr1.test.c1000.og
975M    chr1.test.c1000.og
odgi sort -i chr1.test.c1000.og -o chr1.test.c1000.og.O -O
du -h chr1.test.c1000.og.O
2,5M    chr1.test.c1000.og.O
odgi viz -i chr1.test.c1000.og.O -o chr1.test.c1000.og.O.png

chr1 test c1000 og O There are now so many paths in the PNG, one can barely open it.

3rd round

odgi extract -i chr1.hprc-v1.0-pggb.gfa.og -o chr1.test.c1000d1000.og -t 16 -P -r grch38#chr1:13104252-13122521 -c 1000 -d 1000
odgi sort -i chr1.test.c1000d1000.og -o chr1.test.c1000d1000.og.O -O -P -t 16
odgi viz -i chr1.test.c1000d1000.og.O -o chr1.test.c1000d1000.og.O.png

chr1 test c1000d1000 og O Basically no difference to the 1st round. After close inspection, this makes sense. Because, e.g. the path distances of all CHM13 subpaths are much more than 1000 nucleotides. It seems I should set -d 100000 so I can catch all the missing path parts.

4th round

odgi extract -i chr1.hprc-v1.0-pggb.gfa.og -o chr1.test.c1000d100000.og -t 16 -P -r grch38#chr1:13104252-13122521 -c 1000 -d 100000
odgi sort -i chr1.test.c1000d100000.og -o chr1.test.c1000d100000.og.O -O -t 16 -P
odgi viz -i chr1.test.c1000d100000.og.O -o chr1.test.c1000d100000.og.O.png

chr1 test c1000d100000 og O Alright, this finally shows us that there is lot's of new sequence popping up in this GRCh38 reference region. That's why it is so hard to get a subgraph. Then I thought: What about -E? Here we go.....

5th round

odgi extract -i chr1.hprc-v1.0-pggb.gfa.og -o chr1.test.E.og -t 16 -P -r grch38#chr1:13104252-13122521 -E -d 0                                            
odgi sort -i chr1.test.E.og -o chr1.test.E.og.O -O -P -t 16
odgi viz -i chr1.test.E.og.O -o chr1.test.E.og.O.png

chr1 test E og O This didn't work out at all, I see by far too much of the pangenome compared to the region I wanted to extract. So maybe I need to PG-SGD first?

odgi sort -i chr1.hprc-v1.0-pggb.gfa.og -o chr1.hprc-v1.0-pggb.og.Y -t 28 -P

6th round

odgi extract -i chr1.hprc-v1.0-pggb.og.Y -o chr1.test.d100.og.Y -t 16 -P -r grch38#chr1:13104252-13122521 -d 100
du -h chr1.test.d100.og.Y
43M chr1.test.d100.og.Y
odgi sort -i chr1.test.d100.og.Y -o chr1.test.d100.og.Y.O -O -P -t 16
du -h chr1.test.d100.og.Y.O
864K    chr1.test.d100.og.Y.O
odgi viz -i chr1.test.d100.og.Y.O -o chr1.test.d100.og.Y.O.png

Now the resulting graph is much smaller on disk, great! Locally, this is as best as it can get. However, we still lack the overall picture. chr1 test d100 og Y O

7th round

odgi extract -i chr1.hprc-v1.0-pggb.og.Y -o chr1.test.d100000.og.Y -t 16 -P -r grch38#chr1:13104252-13122521 -d 100000
odgi sort -i chr1.test.d100000.og.Y -o chr1.test.d100000.og.Y.O -O -t 16 -P
odgi viz -i chr1.test.d100000.og.Y.O -o chr1.test.d100000.og.Y.O.png

chr1 test d100000 og Y O I think this is as good as it can get, if one is interested in the overall picture and not-so-fragmented paths.

subwaystation commented 1 year ago

My takaway here is: