Interpreting output from heaps_fit.R

mb47 commented 1 year ago

Hi, thanks for providing ODGI -- it's a treasure trove of useful tools!

I have produced a pan-genome graph and would like to formally work out whether this pan-genome is open or closed. I have found a paper (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6491781/) that describes the calculation of the Heaps coefficient as follows:

"Heaps’ law is formulated as n = κNγ, where n is the pan-genome size, N is the number of genomes used, and κ and γ are the fitting parameters. The exponent γ determines whether the pan-genome has an open property. For γ < 0, the pan-genome is closed, and its size approaches a constant as more genomes are used, whereas for γ > 0, the pan-genome is open, and its size increases as more genomes are included."

I have run odgi heaps on my graph and then used the output from this as input into the heaps_fit.R script, but I am unsure of how to interpret its output:

[...]
$par
[1] 0.38615658 0.19730541 0.09167001

$value
[1] 2.162547

$counts
function gradient
     107      107

$convergence
[1] 0

$message
[1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"

[1] 4050096895
[1] 8737461944
[1] 3374028389
[1] 1723948547
[1] 800963215
[1] 20695431
[1] 494480333

Is one of the three values listed under "$par" the fitting parameter γ that I need? And does the script implement the formula as described in the paper above? Any help much appreciated.

The plot produced by the script would suggest that my pan-genome is open -- the curve doesn't seem to be levelling off asymptotically as I would expect for a closed pan-genome:

heaps

thanks Micha

subwaystation commented 1 year ago

Hi @mb47,

you can also try https://github.com/marschall-lab/panacus which calculates exactly what you have in mind. Example command line with example output:

wget -c https://zenodo.org/record/7937947/files/ecoli50.gfa.zst
zstd -d ecoli50.gfa.zst
RUST_LOG=info panacus histgrowth ecoli50.gfa -c bp -q 0,1,0.5,0.1 -t 16 > ecoli50.gfa.histgrowth.tsv
panacus-visualize.py -e -l "upper left" ecoli50.gfa.histgrowth.tsv > ecoli50.gfa.histgrowth.tsv.pdf

Then you can compare the γ-values from panacus with odgi heap.

mb47 commented 1 year ago

Thanks for the prompt reply - that's excellent! I'll give panacus a shot too. Many thanks Micha

subwaystation commented 1 year ago

I just ran odgi heaps and panacus on the same data set, and par[2] gives you the alpha-value. That's the alternative power law model mentioned in the paper. Taking a look at the function in R that is being optimized, this also fits. The numbers don't match 100%, because the Rscript estimates the curve on the number of given permutations, panacus is able to directly calculate the true curve for all possible permutations. Once ready, I will share a tutorial with more details.

One more comment: For α > 1, the pan-genome is closed, and for α < 1, the pan-genome is open. Heaps’ law and power law are mathematically similar. As also mentioned in https://en.wikipedia.org/wiki/Pan-genome#Classification, this threshold was chosen arbitrarily. So classify whatever makes you happy :P

subwaystation commented 1 year ago

I updated the Rscript in the master branch and it now displays the actual alpha value in the produced plot.

mb47 commented 1 year ago

Excellent -- many thanks for the prompt help! Works a treat: heaps

So this pan-genome is officially still open, with an alpha value of 0.197.

subwaystation commented 1 year ago

Hi @mb47,

sleeping over it, I realized, I made a mistake. What the RScript is giving you is the γ-value: For γ < 0, the pan-genome is closed, and its size approaches a constant as more genomes are used, whereas for γ > 0, the pan-genome is open, and its size increases as more genomes are included. Taking a look at your value and the curve, you clearly have an open pangenome. Sorry for this, I updated the Rscript accordingly.

subwaystation commented 1 year ago

Also thanks for the feedback and for testing this. It helped to improve odgi ;)

mb47 commented 1 year ago

No problem - thanks for clearing this up!

pangenome / odgi

Interpreting output from heaps_fit.R #503