jeffalstott / powerlaw

602 stars 134 forks source link

Don't throw out the zeros? #32

Closed Jophelias closed 8 years ago

Jophelias commented 8 years ago

Hi Jeff et.al,

Great package. I'm using it extensively in a power law class I'm taking.

I'm looking at the freeman centralities distribution of a sparse network where there are a lot of isolates, hence their centralities (for the isolates) is 0. When I try to fit the centralities to a power law distribution using your package, the Fit call throws out all the zeros, but those are legitimate values.

The result is a warning:

Values less than or equal to 0 in data. Throwing out 0 or negative values Calculating best minimal value for power law fit

...and of course, since the x-min value is calculated based on only non-zero values I get a much higher alpha which is not representative of the distribution.

I'm not sure if this is a feature request to include the zeros, or if I have a fundamental misunderstanding of the package and of Clauset's paper.

p.s. Also not sure if this is the appropriate place to bring this up, so sorry in advance.

Joe

jeffalstott commented 8 years ago

A power law has the distribution: p(x) ~ x^a

Stick 0 in for x. What do you get? If a is positive: p(0) ~ 0^a = 0

If a is negative: p(0) ~ 0^-a = undefined

From Clauset et al. (right after equation 2.1): "Clearly this density diverges as x → 0".

In a network, a centrality of 0 is a valid value. In a power law distribution, it is not. You'll need to model the density of the 0s using some other method. A typical route would be to build a separate model for "0 vs. non-0", and then use another model for ">0".

SOME of the other distributions implemented in powerlaw are defined at 0. It's conceivable that one could want to use powerlaw to fit those distributions to data with zeros. But it's outside of the scope of how powerlaw is currently constructed; the very first thing that's done is fit the data to a power law. It's possible people would want to fit those other distributions to their whole data set (including 0s), but then they can't directly compare that distributions' goodness of fit to that of the power law; they're using different data! In lieu of a sensible solution to this mathematical and philosophical problem, I'm going to close this issue. If anyone has bright ideas about how to more elegantly address 0s, we can reopen it. A solution to this topic, however, would likely merit publishing an article about it (or at least finding an article that has already been written on the topic).

Jophelias commented 8 years ago

Thank you for such a prompt response Jeff. I really appreciate the comments and the work that goes into this project.

I see the dilemma.

I guess in the absence of a closed form solution for this at the moment, I will set my isolates to have a degree of one and therefore their freeman centrality will be something on the order of 1/n, a very small number which will sidestep the math for now but will likely give me what I need. I imagine this will become an issue for very large sparse networks, but for now I'll figure it out.

If I do publish a paper or run into one related to this topic, I will post under this thread.

Thanks again.