Closed mb47 closed 1 year ago
Hi @mb47,
you can also try https://github.com/marschall-lab/panacus which calculates exactly what you have in mind. Example command line with example output:
wget -c https://zenodo.org/record/7937947/files/ecoli50.gfa.zst
zstd -d ecoli50.gfa.zst
RUST_LOG=info panacus histgrowth ecoli50.gfa -c bp -q 0,1,0.5,0.1 -t 16 > ecoli50.gfa.histgrowth.tsv
panacus-visualize.py -e -l "upper left" ecoli50.gfa.histgrowth.tsv > ecoli50.gfa.histgrowth.tsv.pdf
Then you can compare the γ-values from panacus with odgi heap.
Thanks for the prompt reply - that's excellent! I'll give panacus a shot too. Many thanks Micha
I just ran odgi heaps and panacus on the same data set, and par[2] gives you the alpha-value. That's the alternative power law model mentioned in the paper. Taking a look at the function in R that is being optimized, this also fits. The numbers don't match 100%, because the Rscript estimates the curve on the number of given permutations, panacus is able to directly calculate the true curve for all possible permutations. Once ready, I will share a tutorial with more details.
One more comment: For α > 1, the pan-genome is closed, and for α < 1, the pan-genome is open. Heaps’ law and power law are mathematically similar.
As also mentioned in https://en.wikipedia.org/wiki/Pan-genome#Classification, this threshold was chosen arbitrarily. So classify whatever makes you happy :P
I updated the Rscript in the master branch and it now displays the actual alpha value in the produced plot.
Excellent -- many thanks for the prompt help! Works a treat:
So this pan-genome is officially still open, with an alpha value of 0.197.
Hi @mb47,
sleeping over it, I realized, I made a mistake. What the RScript is giving you is the γ-value: For γ < 0, the pan-genome is closed, and its size approaches a constant as more genomes are used, whereas for γ > 0, the pan-genome is open, and its size increases as more genomes are included.
Taking a look at your value and the curve, you clearly have an open pangenome.
Sorry for this, I updated the Rscript accordingly.
Also thanks for the feedback and for testing this. It helped to improve odgi ;)
No problem - thanks for clearing this up!
Hi, thanks for providing ODGI -- it's a treasure trove of useful tools!
I have produced a pan-genome graph and would like to formally work out whether this pan-genome is open or closed. I have found a paper (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6491781/) that describes the calculation of the Heaps coefficient as follows:
"Heaps’ law is formulated as n = κNγ, where n is the pan-genome size, N is the number of genomes used, and κ and γ are the fitting parameters. The exponent γ determines whether the pan-genome has an open property. For γ < 0, the pan-genome is closed, and its size approaches a constant as more genomes are used, whereas for γ > 0, the pan-genome is open, and its size increases as more genomes are included."
I have run odgi heaps on my graph and then used the output from this as input into the heaps_fit.R script, but I am unsure of how to interpret its output:
Is one of the three values listed under "$par" the fitting parameter γ that I need? And does the script implement the formula as described in the paper above? Any help much appreciated.
The plot produced by the script would suggest that my pan-genome is open -- the curve doesn't seem to be levelling off asymptotically as I would expect for a closed pan-genome:
thanks Micha