ntamas / plfit

Fitting power-law distributions to empirical data, according to the method of Clauset, Shalizi and Newman
GNU General Public License v2.0
47 stars 17 forks source link

underflow in zeta.c #13

Closed dgleich closed 6 years ago

dgleich commented 8 years ago

We are running a large number of powerlaw fits and we need exact p-values. One of the issues we are running into is that we will frequently get underflow errors in the zeta.c command. This kills the command, but I'm wondering if there might not be some other way to handle underflow error when they are used in determining the p-value computation.

We are trying to find an example with a seed that reproduces these issues. We will post an update once we have that!

dgleich commented 8 years ago

I posted a set of degrees here: https://gist.github.com/dgleich/f2e81ff3506d07b9e310e484f7fb3010

$ ./plfit -s 0 -M -p exact failure.degs 
Error at ../src/zeta.c:98 : underflow, Underflow
Aborted (core dumped)
ntamas commented 8 years ago

Strange - your particular file did not cause a crash on my machine (even though I have used the same random seed). But irrespectively of this, I know what causes the problem in most cases. When we are estimating alpha, we use an implementation of the L-BFGS algorithm to find the local extremum of the log-likelihood function. The algorithm is a "black box" - I just provide it with a callback function that evaluates the log-likelihood and estimates its derivative at any given value of alpha. It is the estimation of the derivative that causes problems because it sometimes involves calculating the difference between the value of the Hurwitz zeta function at two points:

g[0] = data->logsum + data->m * (log(gsl_sf_hzeta(x[0] + dx, data->xmin)) -
    log(gsl_sf_hzeta(x[0], data->xmin))) / dx

When the underflow error occurs, the value of gsl_sf_hzeta() is in the order of 1e-200 or so on my machine, and we cannot represent such small values in 64-bit floats. It does not matter that I only need the logarithm of it because I haven't found a way to calculate the logarithm of the Hurwitz zeta without calculating the Hurwitz zeta function first. So, if you happen to know a way to calculate log(gsl_sf_hzeta()), let me know because that would solve this problem. Until then, the best I can do is to pretend that the derivative is zero there, or give up on the optimization of alpha and just generate a new sample for purposes of the p-value computation, but neither of these options is scientifically sound.

dgleich commented 8 years ago

Hmm... the derivative in that case could actually be extremely steep as we are taking logs of very small numbers.

Just to clarify, this is minimizing negative log-likelihood? (or maximizing log-liklihhood?) So if the log(hzeta) is very small, that's "good" for the optimization, right? (I just want to make sure that these points are attractors for the optimization instead of repellors -- if they'd repel the algorithm, then a small perturbation should fix things, but if they attract, then it's more difficult.)

ntamas commented 8 years ago

the derivative in that case could actually be extremely steep as we are taking logs of very small numbers.

That's true, but in the end the derivative will depend on the difference of two such very small numbers, so it could go either way - the derivative could be very steep, but it could even be affected seriously by numeric errors.

Just to clarify, this is minimizing negative log-likelihood? (or maximizing log-liklihhood?)

Yes, we are maximizing the log-likelihood (or minimizing the negative log-likelihood).

So if the log(hzeta) is very small, that's "good" for the optimization, right?

Well, I'm not sure because it's a difference of log(hzeta(x, xmin)) and log(hzeta(x+something_small, xmin)), and in the end we divide it by something_small, so it all depends on the relative magnitude of something_small compared to log(hzeta(x)), which is also very small. It's hard to say anything definite here.

For what it's worth, here's an alternative data set that also causes an underflow:

https://gist.github.com/ntamas/c4aa6226616f7bf6b61d8e3b3dd36dee

This one does not even get to the p-value estimation stage and it is much smaller, so it's better for experimenting and investigating the issue.

dgleich commented 7 years ago

Another way to go here would be to try and use quad or even extended precision to compute these gradients, then convert back to double.

I tried to find some extended precision libraries for the Hurwitz zeta function, and there are a few, but I haven't tracked down easy-to-use versions of them yet.

http://www.davidhbailey.com/dhbsoftware/ - mpfun has mpzeta which I think does something close. and http://www.davidhbailey.com/dhbpapers/lerch.pdf

http://arxiv.org/abs/1309.2877 - https://github.com/fredrik-johansson/arb

Maybe @fredrik-johansson could comment if he has a second on if there would be a super-easy way of increasing the precision here...

fredrik-johansson commented 7 years ago

Are you computing zeta(s,a) for a > 1 and s >> 1? If so, zeta(s,a) ~ a^-s, so using log(zeta(s,a)) ~ -s*log(a) in the right range might be all you need.

Using arb is one option if you need higher precision; you could modify https://github.com/fredrik-johansson/arbcmath to compute the log of Hurwitz zeta accurately with double input/output.

jgmbenoit commented 6 years ago

I have just pull a patch that implements the first (and second) derivative of the Hurwitz zeta funtion. The issue is still present. Any idea ?

ntamas commented 6 years ago

I'm on the road at the moment, but I'll try and test the pull request in the next few days.

jgmbenoit commented 6 years ago

Meanwhile, I realized that indeed we may implement the lnhzeta and lnhzeta_deriv functions instead. I will see what I can do. Have a good trip !

ntamas commented 6 years ago

Yes, we actually need the derivative of log(hzeta(x, xmin)) w.r.t. x so having a separate function for the derivative of hzeta(x, xmin) probably wouldn't help much with the numerical precision issues. Let me know whether you have managed to implement a numerically stable version of lnhzeta_deriv.

jgmbenoit commented 6 years ago

Thanks for the feed back. Today, I managed the numerical analysis for a >> s. Now I have to implement it: the code for hzeta and hzeta_deriv must adapted.

jgmbenoit commented 6 years ago

I am currently incorporating an implementation lnhzeta and lnhzeta_deriv : the issue remains because hzeta is called in plfit_i_ks_test_discrete : can we manage to call lnhzeta here ?

ntamas commented 6 years ago

This is now fixed by #17 - thanks for all your hard work @jgmbenoit, it was very useful!