marschall-lab / panacus

Panacus is a tool for computing statistics for GFA-formatted pangenome graphs
MIT License
73 stars 4 forks source link

panacus-visualize.py is overwhelmed by 1000 haplotypes #21

Closed subwaystation closed 4 months ago

subwaystation commented 5 months ago

Hi there :) I applied panacus-visualize.py to a histgrowth output of 1000 haplotypes, but the PDF is not showing any colors and some weird x-axis labels:

image

The TSV input is available for 10 days at https://fex.belwue.de/fop/rFYpUmCn/chr19.1000.fa.gz.gfaffix.unchop.Ygs.og.crush.gfa.histgrowth.tsv.

panacus-visualize.py -e -l "lower right" chr19.1000.fa.gz.gfaffix.unchop.Ygs.og.crush.gfa.histgrowth.tsv > chr19.1000.fa.gz.gfaffix.unchop.Ygs.og.crush.gfa.histgrowth.tsv.pdf

Next I want to try it on a data set with ~2k sequences. Thanks for any feedback :)

danydoerr commented 5 months ago

The visualization is not made for 1000 haplotypes. I suggest you use your own--a simple line plot will do just fine! In fact, also panacus is not yet optimized for such a pangenome size, but an implementation is underway...

subwaystation commented 5 months ago

Do you have the HEX of all these colors somewhere for me? So I can at least make it look like it came from panacus ;)

danydoerr commented 5 months ago

I suggest the following: Take the panacus-visualize script, dump it in a Jupyter Notebook, re-use the functions, and change those that you want to improve on. The script is simple and easy to understand. That's at least what I am doing if I need to customize panacus output. At some point, I should provide such a notebook in the repository.

subwaystation commented 5 months ago

I went for R, that's why I am asking :bowtie: I see you are using a Seaborn color palette. I will get it going somehow!

danydoerr commented 5 months ago

There you go... const PCOLORS = ['#f77189', '#bb9832', '#50b131', '#36ada4', '#3ba3ec', '#e866f4']; -> https://github.com/marschall-lab/panacus/blob/f2a1ca8278ac4e087acfec5ea471aff072b1fa34/etc/lib.js#L5C1-L5C84

subwaystation commented 5 months ago

Alright, now panacus itself seems to be overwhelmed:

RUST_LOG=info panacus histgrowth ecoli2146.pan.explode.0.og.crush.gfa -c bp -q 0,1,0.5,0.1 -t 28 > ecoli2146.pan.explode.0.og.crush.gfa.histgrowth.tsv

This results in a huge number of NaN in the resulting TSV. Any ideas? ecoli2146.pan.explode.0.og.crush.gfa.histgrowth.txt

The lengths of the paths vary a lot, maybe this is the problem?

danydoerr commented 5 months ago

I'm surprised that this doesn't work, and I suspect it's a fixable bug. @lucaparmigiani what do you think?

lucaparmigiani commented 4 months ago

Thanks Simon for letting us know about the NaN!

It was indeed a bug and your graph was causing a f64 overflow!! Now the values are handled better and it is fixed. You can run it on your graph :)

danydoerr commented 4 months ago

Thanks @lucaparmigiani

subwaystation commented 4 months ago

Indeed this solved the issue, thanks!