jeffalstott / powerlaw

602 stars 134 forks source link

Fitting a powerlaw with the xmax parameter #89

Closed nsfzyzz closed 1 year ago

nsfzyzz commented 3 years ago

Hi Jeff,

Thanks for providing the great tool!

I have a question about the powerlaw fitting with xmax. IMHO, if I fit a powerlaw with a given xmax, I should normalize the probabilistic density function with the xmax. For example, if I have a powerlaw that only has a xmin, the pdf is

Screen Shot 2021-10-15 at 12 27 30 AM

If I have both xmin and xmax, the pdf should be

Screen Shot 2021-10-15 at 12 28 32 AM

Does it mean we need to change the pdf here? https://github.com/jeffalstott/powerlaw/blob/6732699d790edbe27c2790bf22c3ef7355d2b07e/powerlaw.py#L1188

When I give the xmax, the code routes me to this normalizer which does not seem to take into account of xmax and still uses the pdf with only xmin. I don't know if this is something that needs to be fixed or something that you intended to use.

Thanks!

jeffalstott commented 3 years ago

You might be right!

  1. For discrete distributions, it looks like the code is handling things fine. Can you test this directly by making a discrete pdf made from real numbers (e.g. between 2 and 10 with an xmax of 2) and manually add up the probabilities? It should be 1.
  2. For continuous distributions, it looks like the larger Distribution class has the right code here: https://github.com/jeffalstott/powerlaw/blob/6732699d790edbe27c2790bf22c3ef7355d2b07e/powerlaw.py#L879-L884 However, the Power_Law class appears to be squashing the definition of _pdf_continuous_normalizer with its own, which doesn't have any plumbing to handle xmax. Can you test this directly by making a continuous pdf made from real numbers (e.g. between 2 and 10 with an xmax of 2) and manually add up the probabilities? If it's much less than 1 then we're losing something from the tail above xmax.

Thanks, Jeff

nsfzyzz commented 3 years ago

I tried to look at both the discrete and continuous cases.

If you use discrete=False, the normalizer does not go to the _pdf_continuous_normalizer function with xmax. I tried to generate a pdf between 2 and 10, and the summation of the pdf is about 0.8. After you correct the normalization constant, the summation of the pdf is 1.002. So this part is reasonable after the fix, I think. You can change the continuous normalization to the following line. return (1-self.alpha)/(self.xmax**(1-self.alpha) - self.xmin**(1-self.alpha))

However, for the discrete case, even though the code goes to the correct normalizer, the cdf used in the normalizer is 1-zeta(alpha,x). https://github.com/jeffalstott/powerlaw/blob/6732699d790edbe27c2790bf22c3ef7355d2b07e/powerlaw.py#L1175

Are you sure this is correct cdf? For example, if x=1, alpha=2, zeta(2,1)>1. Then, the cdf 1-zeta(2,1) gives something smaller than 0.

jeffalstott commented 3 years ago

In both cases, since the CDF is fine the bug in the PDF shouldn't affect the selection of xmin. It's not immediately clear to me what the effect of the unnormalized PDF is; the effect might only be on plotting (e.g. it scoots the numbers up and down). Thoughts?

charlesmartin14 commented 3 years ago

The issue here is when the Power Law displays finite-size effects, causing the tail to decay slower than expected for the scale-free portion of the PDF, and yielding alpha sometimes as large as 8 or more., which are non-sensical.

This is a known problem with the Clauset MLE estimator and there have been several proposed solutions to this when applying it to real-world, finite data https://www.pnas.org/content/118/2/e2013825118.short https://www.frontiersin.org/articles/10.3389/fphys.2016.00250/full

I have found that, when using this tool, which Is very helpful BTW, unfortunately, it frequently gives crazy large alphas for PL distributions because of this effect, and some care is needed to apply it properly. This makes it difficult to use this package as a plug-in tool for other packages, like the weightwatcher project,

Here, the suggestion is to modify the Clauset normalization to try to explicitly account for the finite size of the data, so that results would be different when xmax is specified explicitly by the user.

Note that if you add this fix and make it default, it may break back compatibility with older work using this package (such as my own work in Nature), and you should retain the bug as an option for full back-compatibility https://www.nature.com/articles/s41467-021-24025-8

jeffalstott commented 3 years ago

Thanks, Charles! I am confused: I don't see how dynamics of "approaching the limit" of xmax relate to whether or not powerlaw is chopping off probability mass for values higher than xmax. Am I missing something?

On Mon, Oct 18, 2021 at 1:48 AM Charles Martin @.***> wrote:

The issue here is when the Power Law displays finite-size effects, causing the tail to decay slower than expected for the scale-free portion of the PDF

This is a known problem with the Clauset MLE estimator and there have been several proposed solutions to this when applying it to real-world, finite data https://www.pnas.org/content/118/2/e2013825118.short

Here, the suggestion is to modify the Clauset normalization to try to explicitly account for the finite size of the data, so that results would be different when xmax is specified explicitly by the user.

Note that if you add this fix and make it default, it may break back compatibility with older work using this package (such as my own work in Nature), and you should retain the bug as an option for full back compatibility https://www.nature.com/articles/s41467-021-24025-8

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/jeffalstott/powerlaw/issues/89#issuecomment-945384735, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHL7L3Z2RBCSOTWDHYCGN3UHOYKLANCNFSM5GBDYIAQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

charlesmartin14 commented 3 years ago

There are number of models for dealing with the cutoff or finite-size-effects that arise in when trying to model real world, finite data with a simple power law distribution.

Here's a simple example from a model called Upper Truncated Power Law https://www.researchgate.net/publication/263801375_Upper-truncated_power_law_distributions where it is clear that one can obtain different alphas for the same distribution depending on the underlying PL model

Screen Shot 2021-10-20 at 9 56 18 AM

The suggestion in this thread is to let the user specify a different PL model (i.e with a different normalization) in order to model situations like this, by, say , setting xmax explicitly. And it seemed like this was actually the intent of the package but there may have been a bug ?

In your powerlaw package, it was unclear what your intent was when the user sets xmax explicitly vs having the code determine xmax from in input data? Can you clarify the intent of setting the xmax parameter ?

jeffalstott commented 3 years ago

"Can you clarify the intent of setting the xmax parameter ?" p(x) ~ x^-alpha when between xmin and xmax, and p(x)=0 everywhere else.

"Here, in this code, it was unclear what your intent was when the user sets xmax explicitly vs having the code determine xmax from in input data?" In powerlaw, xmax is only ever set by the user.

Thanks!

On Wed, Oct 20, 2021 at 12:58 PM Charles Martin @.***> wrote:

There are number of models for dealing with the cutoff or finite-size-effects that arise in when trying to model real world, finite data with a simple power law distribution.

Here's a simple example from a model called Upper Truncated Power Law

https://www.researchgate.net/publication/263801375_Upper-truncated_power_law_distributions

[image: Screen Shot 2021-10-20 at 9 56 18 AM] https://user-images.githubusercontent.com/498448/138137781-f9d27c7a-a25d-421e-80a6-9d540e335dd4.png

Here, in this code, it was unclear what your intent was when the user sets xmax explicitly vs having the code determine xmax from in input data? Can you clarify the intent of setting the xmax parameter ?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/jeffalstott/powerlaw/issues/89#issuecomment-947856497, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHL7LYYWNDQW3HVKUZAVK3UH3YMPANCNFSM5GBDYIAQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

charlesmartin14 commented 3 years ago

What I mean is what is the intent of the normalization to be used, since this thread is asking if there is a bug in the choice of normalization?

jeffalstott commented 3 years ago

The choice (simply cut off all probability above xmax) may or not have been good. Regardless, it appears to not be implemented for the PDFs!

charlesmartin14 commented 3 years ago

I take it then that there is indeed a bug in the normalization of the PDF, when xmax is set explicitly?

If so, and you make a change, it would be helpful to have the older version implemented as a fall back option, so that our work in the field is back compatible with new versions of the package

jeffalstott commented 3 years ago

Seems like it! @nsfzyzz Do you have a pull request?

Regardless of any fixes to powerlaw, older versions will stay up on both PyPI and Github. What is the behavior that you observed that was important to your work?

charlesmartin14 commented 3 years ago

The current research uses the existing implementation of the normalization of the PDF, even when xmax is set explicitly:

See: https://github.com/CalculatedContent/WeightWatcher/blob/b32f4876659275b8148b024f9a4fed95b1524e90/weightwatcher/weightwatcher.py#L1975

Old code versions become stale and difficult to support over time. Python changes. Other packages change. etc.

Moreover, it would be desirable to have both normalizations available to allow

Without both, I would need to fork the code to ensure both normalization choices are available. I would prefer not do that

jeffalstott commented 3 years ago

Thanks @charlesmartin14. What was the algorithm you were expecting powerlaw to run when you called it with an xmax value? I.e. what did you want to have happen?

charlesmartin14 commented 3 years ago

I expected that it was finding the optimal alpha, using the Clauset MLE approach, as described in this paper https://arxiv.org/abs/0706.1062

and with the normalization as described in the paper

Screen Shot 2021-10-25 at 9 54 34 AM

According to the documentation (i.e the 2014 PLOS paper ), I expected that, as stated on page 4: Any data above xmax is ignored for fitting.

Screen Shot 2021-10-25 at 9 59 17 AM

(and a quick glance at references [3,4] , I don't find any specific discussion where the normalization is redefined)

nsfzyzz commented 3 years ago

I updated the code to use the new pdf continuous normalizer. Based on the discussion here, it looks like the best way to solve this is to use a parameter to select which normalizer we want. So I used this parameter pdf_ends_at_xmax. If it is set True, the code uses the new normalizer. Otherwise, the code uses the old one.

Here is some code to test the new feature.

import powerlaw
import numpy as np
xmin, xmax, num_steps=2, 10, 1000
pdf_ends_at_xmax = True
#pdf_ends_at_xmax = False
data=np.linspace(xmin,xmax,num_steps, endpoint=False)
step_size = (xmax-xmin)/num_steps
Distribution1 = powerlaw.Power_Law(xmin=xmin, xmax=xmax, data=data, pdf_ends_at_xmax=pdf_ends_at_xmax)
Distribution1.alpha = 2.0
pdf = Distribution1.pdf(data)*step_size
print(sum(pdf))

# Test the following to see if the Fit class uses the correct normalizer
#fit1 = powerlaw.Fit(xmin=xmin, xmax=xmax, data=data, pdf_ends_at_xmax=pdf_ends_at_xmax)
#fit1 = powerlaw.Fit(xmax=xmax, data=data, pdf_ends_at_xmax=pdf_ends_at_xmax)
jeffalstott commented 3 years ago

Thanks @nsfzyzz ! Before integrating your pull request, a question for the room: It would be best to have a solution that is adaptable across as many different distributions in the Distribution class as possible. The cdf functions can enable this, since the tail of the CDF is defined by the CDF, so you know how much to cut off. Can we implement a general solution across all of Distribution? I am sure this was once my intent, but it has been nearly a decade since I wrote powerlaw and there may have been a gremlin in implementation that kept me away that I have now forgotten.

charlesmartin14 commented 3 years ago

If there is a general way to cut the dropping part of the tail off (i.e to set xmax), that would be helpful