Closed nsfzyzz closed 1 year ago
You might be right!
xmax
of 2) and manually add up the probabilities? It should be 1.Distribution
class has the right code here:
https://github.com/jeffalstott/powerlaw/blob/6732699d790edbe27c2790bf22c3ef7355d2b07e/powerlaw.py#L879-L884
However, the Power_Law
class appears to be squashing the definition of _pdf_continuous_normalizer
with its own, which doesn't have any plumbing to handle xmax
.
Can you test this directly by making a continuous pdf made from real numbers (e.g. between 2 and 10 with an xmax
of 2) and manually add up the probabilities? If it's much less than 1 then we're losing something from the tail above xmax
.Thanks, Jeff
I tried to look at both the discrete and continuous cases.
If you use discrete=False, the normalizer does not go to the _pdf_continuous_normalizer function with xmax. I tried to generate a pdf between 2 and 10, and the summation of the pdf is about 0.8. After you correct the normalization constant, the summation of the pdf is 1.002. So this part is reasonable after the fix, I think. You can change the continuous normalization to the following line.
return (1-self.alpha)/(self.xmax**(1-self.alpha) - self.xmin**(1-self.alpha))
However, for the discrete case, even though the code goes to the correct normalizer, the cdf used in the normalizer is 1-zeta(alpha,x). https://github.com/jeffalstott/powerlaw/blob/6732699d790edbe27c2790bf22c3ef7355d2b07e/powerlaw.py#L1175
Are you sure this is correct cdf? For example, if x=1, alpha=2, zeta(2,1)>1. Then, the cdf 1-zeta(2,1) gives something smaller than 0.
Discrete:
The CDF get calculated using the Distribution
class cdf
function, which accounts for xmax
:
https://github.com/jeffalstott/powerlaw/blob/6732699d790edbe27c2790bf22c3ef7355d2b07e/powerlaw.py#L782-L783
What about the PDF?
Continuous: Looks like you have found a bug! Can you submit a pull request?
In both cases, since the CDF is fine the bug in the PDF shouldn't affect the selection of xmin
. It's not immediately clear to me what the effect of the unnormalized PDF is; the effect might only be on plotting (e.g. it scoots the numbers up and down). Thoughts?
The issue here is when the Power Law displays finite-size effects, causing the tail to decay slower than expected for the scale-free portion of the PDF, and yielding alpha sometimes as large as 8 or more., which are non-sensical.
This is a known problem with the Clauset MLE estimator and there have been several proposed solutions to this when applying it to real-world, finite data https://www.pnas.org/content/118/2/e2013825118.short https://www.frontiersin.org/articles/10.3389/fphys.2016.00250/full
I have found that, when using this tool, which Is very helpful BTW, unfortunately, it frequently gives crazy large alphas for PL distributions because of this effect, and some care is needed to apply it properly. This makes it difficult to use this package as a plug-in tool for other packages, like the weightwatcher project,
Here, the suggestion is to modify the Clauset normalization to try to explicitly account for the finite size of the data, so that results would be different when xmax is specified explicitly by the user.
Note that if you add this fix and make it default, it may break back compatibility with older work using this package (such as my own work in Nature), and you should retain the bug as an option for full back-compatibility https://www.nature.com/articles/s41467-021-24025-8
Thanks, Charles! I am confused: I don't see how dynamics of "approaching
the limit" of xmax
relate to whether or not powerlaw
is chopping off
probability mass for values higher than xmax
. Am I missing something?
On Mon, Oct 18, 2021 at 1:48 AM Charles Martin @.***> wrote:
The issue here is when the Power Law displays finite-size effects, causing the tail to decay slower than expected for the scale-free portion of the PDF
This is a known problem with the Clauset MLE estimator and there have been several proposed solutions to this when applying it to real-world, finite data https://www.pnas.org/content/118/2/e2013825118.short
Here, the suggestion is to modify the Clauset normalization to try to explicitly account for the finite size of the data, so that results would be different when xmax is specified explicitly by the user.
Note that if you add this fix and make it default, it may break back compatibility with older work using this package (such as my own work in Nature), and you should retain the bug as an option for full back compatibility https://www.nature.com/articles/s41467-021-24025-8
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/jeffalstott/powerlaw/issues/89#issuecomment-945384735, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHL7L3Z2RBCSOTWDHYCGN3UHOYKLANCNFSM5GBDYIAQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
There are number of models for dealing with the cutoff or finite-size-effects that arise in when trying to model real world, finite data with a simple power law distribution.
Here's a simple example from a model called Upper Truncated Power Law https://www.researchgate.net/publication/263801375_Upper-truncated_power_law_distributions where it is clear that one can obtain different alphas for the same distribution depending on the underlying PL model
The suggestion in this thread is to let the user specify a different PL model (i.e with a different normalization) in order to model situations like this, by, say , setting xmax explicitly. And it seemed like this was actually the intent of the package but there may have been a bug ?
In your powerlaw package, it was unclear what your intent was when the user sets xmax explicitly vs having the code determine xmax from in input data? Can you clarify the intent of setting the xmax parameter ?
"Can you clarify the intent of setting the xmax parameter ?"
p(x) ~ x^-alpha
when between xmin
and xmax
, and p(x)=0
everywhere
else.
"Here, in this code, it was unclear what your intent was when the user sets
xmax explicitly vs having the code determine xmax from in input data?"
In powerlaw
, xmax
is only ever set by the user.
Thanks!
On Wed, Oct 20, 2021 at 12:58 PM Charles Martin @.***> wrote:
There are number of models for dealing with the cutoff or finite-size-effects that arise in when trying to model real world, finite data with a simple power law distribution.
Here's a simple example from a model called Upper Truncated Power Law
https://www.researchgate.net/publication/263801375_Upper-truncated_power_law_distributions
[image: Screen Shot 2021-10-20 at 9 56 18 AM] https://user-images.githubusercontent.com/498448/138137781-f9d27c7a-a25d-421e-80a6-9d540e335dd4.png
Here, in this code, it was unclear what your intent was when the user sets xmax explicitly vs having the code determine xmax from in input data? Can you clarify the intent of setting the xmax parameter ?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/jeffalstott/powerlaw/issues/89#issuecomment-947856497, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHL7LYYWNDQW3HVKUZAVK3UH3YMPANCNFSM5GBDYIAQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
What I mean is what is the intent of the normalization to be used, since this thread is asking if there is a bug in the choice of normalization?
The choice (simply cut off all probability above xmax
) may or not have been good. Regardless, it appears to not be implemented for the PDFs!
I take it then that there is indeed a bug in the normalization of the PDF, when xmax is set explicitly?
If so, and you make a change, it would be helpful to have the older version implemented as a fall back option, so that our work in the field is back compatible with new versions of the package
Seems like it! @nsfzyzz Do you have a pull request?
Regardless of any fixes to powerlaw
, older versions will stay up on both PyPI and Github. What is the behavior that you observed that was important to your work?
The current research uses the existing implementation of the normalization of the PDF, even when xmax is set explicitly:
Old code versions become stale and difficult to support over time. Python changes. Other packages change. etc.
Moreover, it would be desirable to have both normalizations available to allow
Without both, I would need to fork the code to ensure both normalization choices are available. I would prefer not do that
Thanks @charlesmartin14. What was the algorithm you were expecting powerlaw
to run when you called it with an xmax
value? I.e. what did you want to have happen?
I expected that it was finding the optimal alpha, using the Clauset MLE approach, as described in this paper https://arxiv.org/abs/0706.1062
and with the normalization as described in the paper
According to the documentation (i.e the 2014 PLOS paper ), I expected that, as stated on page 4: Any data above xmax is ignored for fitting.
(and a quick glance at references [3,4] , I don't find any specific discussion where the normalization is redefined)
I updated the code to use the new pdf continuous normalizer. Based on the discussion here, it looks like the best way to solve this is to use a parameter to select which normalizer we want. So I used this parameter pdf_ends_at_xmax
. If it is set True, the code uses the new normalizer. Otherwise, the code uses the old one.
Here is some code to test the new feature.
import powerlaw
import numpy as np
xmin, xmax, num_steps=2, 10, 1000
pdf_ends_at_xmax = True
#pdf_ends_at_xmax = False
data=np.linspace(xmin,xmax,num_steps, endpoint=False)
step_size = (xmax-xmin)/num_steps
Distribution1 = powerlaw.Power_Law(xmin=xmin, xmax=xmax, data=data, pdf_ends_at_xmax=pdf_ends_at_xmax)
Distribution1.alpha = 2.0
pdf = Distribution1.pdf(data)*step_size
print(sum(pdf))
# Test the following to see if the Fit class uses the correct normalizer
#fit1 = powerlaw.Fit(xmin=xmin, xmax=xmax, data=data, pdf_ends_at_xmax=pdf_ends_at_xmax)
#fit1 = powerlaw.Fit(xmax=xmax, data=data, pdf_ends_at_xmax=pdf_ends_at_xmax)
Thanks @nsfzyzz ! Before integrating your pull request, a question for the room: It would be best to have a solution that is adaptable across as many different distributions in the Distribution
class as possible. The cdf
functions can enable this, since the tail of the CDF is defined by the CDF, so you know how much to cut off. Can we implement a general solution across all of Distribution
? I am sure this was once my intent, but it has been nearly a decade since I wrote powerlaw
and there may have been a gremlin in implementation that kept me away that I have now forgotten.
If there is a general way to cut the dropping part of the tail off (i.e to set xmax), that would be helpful
Hi Jeff,
Thanks for providing the great tool!
I have a question about the powerlaw fitting with xmax. IMHO, if I fit a powerlaw with a given xmax, I should normalize the probabilistic density function with the xmax. For example, if I have a powerlaw that only has a xmin, the pdf is
If I have both xmin and xmax, the pdf should be
Does it mean we need to change the pdf here? https://github.com/jeffalstott/powerlaw/blob/6732699d790edbe27c2790bf22c3ef7355d2b07e/powerlaw.py#L1188
When I give the xmax, the code routes me to this normalizer which does not seem to take into account of xmax and still uses the pdf with only xmin. I don't know if this is something that needs to be fixed or something that you intended to use.
Thanks!