jeffalstott / powerlaw

602 stars 134 forks source link

Fitting issues with standard configuration #61

Open polakowo opened 6 years ago

polakowo commented 6 years ago

Hello,

first of all thank you for providing us with such a comprehensive package. I'm a first time user and cannot figure out how to produce meaningful results despite reading your paper.

I have the following distribution: Counter({1: 233, 2: 33, 3: 13, 4: 10, 5: 2, 6: 3, 7: 2, 8: 2, 9: 1, 11: 1, 15: 1, 18: 1})

When trying to feed the powerlaw.Fit method with my dataset I get warning "*RuntimeWarning: invalid value encountered in true_divide (Theoretical_CDF (1 - Theoretical_CDF))**", which can be ignored as I discovered in older threads.

The issue is poor scaling parameters as an output: alpha = 3.81, xmin = 6. When plotting the hypothesized power-law distribution, you can clearly see the poor fit. unknown-1

On the other hand when using the code provided by Clauset (http://www.santafe.edu/~aaronc/powerlaws/), a better fit is produced with alpha = 2.59, xmin = 1. unknown

I tried a couple of different configurations, but the output stays the same, wondering what's going wrong?

Best regards

jeffalstott commented 6 years ago

Thanks Oleg,

I'm unfamiliar with Counters. What happens if you pass the data as an array? Also, what happens if you mandate xmin=1?

On Mon, Aug 13, 2018 at 7:47 PM Oleg Polakow notifications@github.com wrote:

Hello,

first of all thank you for providing us with such a comprehensive package. I'm a first time user and cannot figure out how to produce meaningful results despite reading your paper.

I have the following distribution: Counter({1: 233, 2: 33, 3: 13, 4: 10, 5: 2, 6: 3, 7: 2, 8: 2, 9: 1, 11: 1, 15: 1, 18: 1})

When trying to feed the powerlaw.Fit method with my dataset I get warning "RuntimeWarning: invalid value encountered in true_divide (Theoretical_CDF (1 - Theoretical_CDF))*", which can be ignored as I discovered in older threads.

The issue is poor scaling parameters as an output: alpha = 3.81, xmin = 6. When plotting the hypothesized power-law distribution, you can clearly see the poor fit. [image: unknown-1] https://user-images.githubusercontent.com/2664108/44064066-92927b2c-9f63-11e8-8651-9063dabe7560.png

On the other hand when using the code provided by Clauset ( http://www.santafe.edu/~aaronc/powerlaws/), a better fit is produced with alpha = 2.59, xmin = 1. [image: unknown] https://user-images.githubusercontent.com/2664108/44064073-9b1d7864-9f63-11e8-82f9-df286ce29022.png

I tried a couple of different configurations, but the output stays the same, wondering what's going wrong?

Best regards

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/jeffalstott/powerlaw/issues/61, or mute the thread https://github.com/notifications/unsubscribe-auth/AA6_r1Fy5Feuqpwt2q-xjDbOYoZciUEpks5uQhApgaJpZM4V7k3r .

polakowo commented 6 years ago

Counter is for demo, usually I pass a flat array of integers from 1 to 18. I also tried your suggestion with xmin but getting alpha=4.87 unknown-2

Could it be that I'm using Python 3.6? I had to upgrade the code provided by Clauset to work with my environment

jeffalstott commented 6 years ago

Interesting. What happens if you use any of the example datasets from the paper?

On Tue, Aug 14, 2018 at 4:13 AM Oleg Polakow notifications@github.com wrote:

Counter is for demo, usually I pass a flat array of integers from 1 to 18. I also tried your suggestion with xmin but getting alpha=4.87 [image: unknown-2] https://user-images.githubusercontent.com/2664108/44079698-0ae587cc-9faa-11e8-9169-22f126c8b192.png

Could it be that I'm using Python 3.6? I had to upgrade the code provided by Clauset to work with my environment

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/jeffalstott/powerlaw/issues/61#issuecomment-412791457, or mute the thread https://github.com/notifications/unsubscribe-auth/AA6_r5P6NhxLnBuRetNBI7IZzszYv3PHks5uQobBgaJpZM4V7k3r .

polakowo commented 6 years ago

I downloaded the Manufscript_code.ipynb file and rerun all cells - output the same as yours.

Is the algorithm in powerlab different from http://tuvalu.santafe.edu/~aaronc/powerlaws/plfit.py?

jeffalstott commented 6 years ago

powerlaw and plfit and (hopefully) other implementations all implement the same algorithm has in the Clauset et al. paper. The present behavior is certainly some non-core algorithm related bug, possibly due to something breaking with Python 3.6. The fact that the demos from manuscript_code.ipynb still work is heartening.

What happens when you pass floats instead of ints?

And just for my situational awareness, are you running the latest powerlaw version from PyPI?

On Tue, Aug 14, 2018 at 7:41 PM Oleg Polakow notifications@github.com wrote:

I downloaded the Manufscript_code.ipynb file and rerun all cells - output the same as yours.

Is the algorithm in powerlab different from http://tuvalu.santafe.edu/~aaronc/powerlaws/plfit.py?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/jeffalstott/powerlaw/issues/61#issuecomment-413050048, or mute the thread https://github.com/notifications/unsubscribe-auth/AA6_rwKXN6DTF5uRwNCzZbK2Qm9bEUmAks5uQ2AogaJpZM4V7k3r .

polakowo commented 6 years ago

Hmm in Jupyter the 1.4.3 is displayed while pip tells that the version 1.4.4 is installed, am I using the older version? Seems the problem is with conda...

jeffalstott commented 6 years ago

Conda installs alongside pip installs always seems to cause heartache. Try the latest version, then see what happens if you pass floats in an array.

On Tue, Aug 14, 2018 at 7:58 PM Oleg Polakow notifications@github.com wrote:

Hmm in Jupyter the 1.4.3 is displayed while pip tells that the version 1.4.4 is installed, am I using the older version? Seems the problem is with conda...

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/jeffalstott/powerlaw/issues/61#issuecomment-413052880, or mute the thread https://github.com/notifications/unsubscribe-auth/AA6_rxZlYLlMkz-DwzoSIX3ObDqY1q2-ks5uQ2Q8gaJpZM4V7k3r .

polakowo commented 6 years ago

After some refactoring: the version in powerlaw.py was not updated to 1.4.4 that's why I'm seeing 1.4.3 :p Now I'm sure its the latest version, floats make no difference.

jeffalstott commented 6 years ago

Oops! I better fix that. I'll hold off on pushing an update until we've figured out this issue, though.

Can you post minimum working examples of getting accurate results on one of the demo datasets and inaccurate results on your dataset?

On Tue, Aug 14, 2018 at 9:05 PM Oleg Polakow notifications@github.com wrote:

After some refactoring: the version in powerlaw.py was not updated to 1.4.4 that's why I'm seeing 1.4.3 :p Now I'm sure its the latest version, floats make no difference.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/jeffalstott/powerlaw/issues/61#issuecomment-413063338, or mute the thread https://github.com/notifications/unsubscribe-auth/AA6_r7eZ0zpfTYru9EwjT5th2dNwcXa9ks5uQ3PPgaJpZM4V7k3r .

polakowo commented 6 years ago

I've created a pdf with some simple tests.

tests.pdf

jeffalstott commented 6 years ago

Are you not using the discrete=True flag?

On Tue, Aug 14, 2018 at 9:59 PM Oleg Polakow notifications@github.com wrote:

I've created a pdf with some simple tests.

tests.pdf https://github.com/jeffalstott/powerlaw/files/2288965/tests.pdf

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/jeffalstott/powerlaw/issues/61#issuecomment-413071051, or mute the thread https://github.com/notifications/unsubscribe-auth/AA6_r5lacX3dBrNYzKthilySfoISmeaQks5uQ4BkgaJpZM4V7k3r .

polakowo commented 6 years ago

Tried, output the same Edit: not the same but similar

polakowo commented 6 years ago

I'll go grab some sleep, we have 4 am :) Please keep me updated on your progress.

jeffalstott commented 6 years ago

It cannot be the same output; the discrete and continuous versions are different algorithms. This is described in the powerlaw and Clauset et al. papers. plfit.py assumes inputs that look like integers are discrete, whereas powerlaw does not. Can you show output with using the discrete=True option for fitting?

On Tue, Aug 14, 2018 at 10:19 PM Oleg Polakow notifications@github.com wrote:

I'll go grab some sleep, we have 4 am :) Please keep me updated on your progress.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/jeffalstott/powerlaw/issues/61#issuecomment-413074049, or mute the thread https://github.com/notifications/unsubscribe-auth/AA6_r62c6SOg8ASxa-NM7jDAIu9nIl-Jks5uQ4VHgaJpZM4V7k3r .

polakowo commented 6 years ago

powerlaw.Fit([500,150,90,81,75,75,70,65,60,58,49,47,40]).alpha 3.1151851915645903

powerlaw.Fit([500,150,90,81,75,75,70,65,60,58,49,47,40], discrete=True).alpha 3.0771455810069

plfit.plfit([500,150,90,81,75,75,70,65,60,58,49,47,40])[0] 2.71

polakowo commented 6 years ago

xmin = 58.0, xmin = 58.0, and xmin = 47 accordingly

polakowo commented 6 years ago

unknown-3 unknown-4 unknown-5

jeffalstott commented 6 years ago

For some realistic data (e.g. your original data, which I hope is large), what does the visualized fit look like with discrete=True?

What happens if the length of the data is 100 or greater, so that plfit doesn't do its finite-size adjustment?

Very small datasets, such as the one just given, are problematic for a bunch of reasons.

On Wed, Aug 15, 2018 at 8:19 AM Oleg Polakow notifications@github.com wrote:

xmin = 58.0, xmin = 58.0, and xmin = 47 accordingly

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/jeffalstott/powerlaw/issues/61#issuecomment-413181007, or mute the thread https://github.com/notifications/unsubscribe-auth/AA6_r6Kth1JzuIBqEnRFGdSGRhTeYrEHks5uRBHfgaJpZM4V7k3r .

polakowo commented 6 years ago

My data has length of ~300

plfit.py gives xmin=1, alpha=2.59 unknown-6

powerlaw.py with discrete=True gives xmin=2, alpha=2.38 unknown-7

polakowo commented 6 years ago

The same applies to data which is 10x larger

jeffalstott commented 6 years ago

Using data on the scale of 3,000 data points, what is the result if you dictate to both that they use the same xmin? And that's using the discrete=True flag.

I would start with a dataset that you know is giving the correct answer (one of the discrete demos) and slowly modify to look like a problematic dataset, then observe what the difference is.

At one point many years ago I knew all the internals of plfit.py, and while I do not remember all of it now, I recall it making some decisions that powerlaw purposefully does not replicate (e.g. assuming data that "looks" discrete is discrete, applying a finite-size "correction", etc.). Now that the discrete=True flag is in use for this example, we likely have moved very far.

On Wed, Aug 15, 2018 at 8:30 AM Oleg Polakow notifications@github.com wrote:

The same applies to data which is 10x larger

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/jeffalstott/powerlaw/issues/61#issuecomment-413183287, or mute the thread https://github.com/notifications/unsubscribe-auth/AA6_rwrrPmROWMwJf-oenDOEuBeezHfVks5uRBRigaJpZM4V7k3r .

polakowo commented 6 years ago

For floats are results are the same. With discrete=True and integer data powerlaw.py produces slightly lower alpha but the same xmin.

If I multiply each element in dataset with 100, results are the same. If I add to each element 100, results vary greatly.

There seem to be some issue with the scale.

jeffalstott commented 6 years ago

I can't speak to what plfit.py is doing, but adding a translation by simply adding a factor k does/should change the probabilities. Look at Clauset et al. eq. 2.2. Take alpha = 2, xmin=1 and x = 2.

(1/1.5)(2.5/1.5)^-2 = .24 Add a translation of 1, thereby increasing xmin and x by 1 (1/2.5)(3.5/2.5)^-2 = .204...

It sounds like powerlaw is giving expected results on demo data, on synthetically generated float data, and synthetically generated discrete data (as long as the discrete flag is turned on). So it sounds like there's as yet no bug here.

Again, we can't speak to what plfit.py is doing, though it does appear to implement a "finite-size correction" for small data. powerlaw doesn't implement this, because its author thought it very philosophically dubious; if you don't have lots of data, you cannot assess if the structure of that data does/does not follow a power law over several orders of magnitude (and if you don't have several orders of magnitude, you shouldn't be trying to justify it's a power law). With that said, if someone wanted to implement optional finite-size corrections in powerlaw and submit it as a pull request, they'd be taken on board.

On Wed, Aug 15, 2018 at 10:20 AM Oleg Polakow notifications@github.com wrote:

For floats are results are the same. With discrete=True and integer data powerlaw.py produces slightly lower alpha but the same xmin.

If I multiply each element in dataset with 100, results are the same. If I add to each element 100, results vary greatly.

There seem to be some issue with the scale.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/jeffalstott/powerlaw/issues/61#issuecomment-413212437, or mute the thread https://github.com/notifications/unsubscribe-auth/AA6_r-zAELs33vfXCJ7peqMqeS7GjTdEks5uRC42gaJpZM4V7k3r .

rhjohnstone commented 6 years ago

I guess I'm having similar problems to @polakowo. I get a very poor fit just using the most basic example in the readme (result is very similar for both discrete=True and discrete=False, as well as recording the data as int or float:

# Number of occurrences, starting from 1
counts = np.array("534602 192300 61903 25675 12380 6472 3727 2082 1441 908 650"
                " 488 341 239 182 123 87 76 46 48 43 27 24 16 11 13 14 7 8 9"
                " 7 7 4 3 3 1 0 2 0 1 1 0 0 0 0 1".split()).astype(int)
data = []
for i, count in enumerate(counts):
    data += [i+1]*count
data = np.array(data)

powerlaw.plot_pdf(data, label="Data as PDF")

fit = powerlaw.Fit(data, discrete=True)
fit.plot_pdf(label="Fitted PDF")
plt.legend(loc=3, fontsize=14);

which outputs image

Am I just plotting the results incorrectly, or is something deeper going on?

Edit: I see now that the "fitted" plot is just using data points >= fit.power_law.xmin, so it's just plotting the data but re-normalized? In which case, is there a simple function to just plot the fitted power curve?

Edit 2: I should be using fit.power_law.plot_pdf(label="Fitted PDF"), and not fit.plot_pdf(label="Fitted PDF"). This is the updated version:

fit = powerlaw.Fit(data, discrete=True)
powerlaw.plot_pdf(data[data>=fit.power_law.xmin], label="Data as PDF")
print(fit.power_law.alpha)
print(fit.power_law.xmin)
R, p = fit.distribution_compare('power_law', 'lognormal')
fit.power_law.plot_pdf(label="Fitted PDF", ls=":")
plt.legend(loc=3, fontsize=14);

which outputs image which I guess is OK, but I'm surprised at how large fit.power_law.xmin is. Just inspecting the original data plot, the line appears straight enough down to 2 or 3. Is there an obvious reason (that I can fix) why it's choosing such a high value?

jeffalstott commented 6 years ago

@rhjohnstone Thanks for the edits; it saved me typing! :)

To your question, xmin is determined by finding the value that minimizes the KS distance between the empirical distribution and a perfect power law. Take a look at the papers Clauset et al. or Alstott et al. for more info.

The larger issue is that the data is almost certainly not a power law, and if it is it doesn't have the interesting properties we look for in a power law. That data shows p(x) dropping ~6 orders of magnitude over ~1.5 OOM of x. That would be an alpha of ~4, which is very steep. The interesting properties of power laws happen at smaller/shallower alphas (mean undefined: 2 ; std undefined: 3). Also, the data is only over 1.5 OOM, so there's likely not enough data to differentiate it from an exponential. If you use 'distribution_compare' you'll probably find the power law fit is less likely than the exponential.