jeffalstott / powerlaw

600 stars 132 forks source link

Small sample size fits documentation #74

Closed brunobowden closed 5 years ago

brunobowden commented 5 years ago

This is a request for documentation but feel free to close if it's doesn't meet the project standard for an "issue". My interest is applying Power Law fit to an investment portfolio which typically has a small sample size (e.g. 10-30 investments). Naturally the fit is poorer with such limited data... so are there any recommended techniques for doing this with the PowerLaw package?

I've researched the project documentation and several of the research papers but couldn't find a good match for this. I suspect it's hard to do this with high confidence but wanted to ask in a public forum to benefit others with the same question. Thanks for providing this tool @jeffalstott

jeffalstott commented 5 years ago

Thank you for asking in a public forum!

Power laws are interesting because of the tails: the rare events are not as rare in an exponential/normal/etc. distribution. But they're still pretty rare. Because of this, you need a lot of data to have a hope of accurately describing the tail (whether the data follows a power law or otherwise).

Consider a power law with alpha=1.5, which is a very shallow power law with a thick tail. Over a range of 1 order of magnitude (e.g. from x=1 to x=10), the probability will shrink by a factor of 10^1.5, which is ~.03 or 3%. How many data points would you need to have in order to even expect to observe something that happens only 3% of the time? 1/.03 = ~30. (Note that this is the value of our original 10^1.5, since we did 1/1/10^1.5, which is handy). The situation gets worse the further out into the tail we get, and the rarer events become. 2 orders of magnitude out means 1/100^1.5=0.1%, which requires 100^1.5=1,000 data points!

Now, how many data points would you need not to just observe a value, but to be confident that an observed value actually occurs 3% or 0.1% or whatever percentage of the time? The answer is definitely "more". How much more? I haven't calculated target sample sizes, but let's say 10 times more. Note that the answer will depend on the value of alpha; steeper power laws with higher alpha will have thinner tails with rare events being rarer, and this will make it even harder to distinguish the power law from another distribution. So, in answer to your question, without further analysis, my rule of thumb might be "For every order of magnitude N the data of X you're trying to cover, you probably need at least 10^(N+1.5) data points to begin to have a hope of identifying a power law. Probably more."

However! While you likely can't find a power law with less than a few thousand data points, you can still try. This is what the loglikelihood ratio provides us, as implemented in distribution_compare. This automatically takes into account the amount of data at hand. The more data we observe, the more evidence we have that the distribution follows a power vs. something else. Don't be surprised, however, if the answer is "eh, maybe."

brunobowden commented 5 years ago

@jeffalstott - thanks for a very thoughtful response. When you don't know what distribution the data has, then I understand needing more data. When you assume that it already has a power law distribution then you can focus on what the best fit is. I will experiment some more with this and see what I can learn.

For venture investing, alpha typically varies between 1.6 to 2.3 (see table below), which further constrains the fit:

Fund Size Alpha
<$100M 1.68
$100M-$250M 1.85
$250M-$500M 1.84
$500M-$1B 2.27
>$1B 1.89

Source: http://reactionwheel.net/2015/06/power-laws-in-venture.html