Computation of cdf, tail approximation, and limits is incorrect

ragibson commented 4 years ago

Sorry for opening up so many issues and for the length of this issue in particular. I was trying to pin down why pylevy's results have extreme discontinuities (the pdf should be infinitely smooth). See below for an example.

levy_discontinuities

When computing your limit data files, you call _get_closest_approx, seemingly to find where the analytic tail approximation matches the numerical integration results most closely. However, this seems to compute the resulting lower_limit and upper_limit arrays incorrectly.

For example, if you call e.g. _get_closest_approx(alpha=1.0, beta=0.0, upper=False), you'll note that https://github.com/josemiotto/pylevy/blob/64c525f273d00d89cbbe531a6557b17b74d18f88/levy/__init__.py#L390-L402 fails because z[mask] is all NaN, so np.argmin returns 0.

1) Couldn't you compute this analytically? E.g. when x > 0, the pdf's asymptotic behavior is monotonically decreasing -- it must approach the true pdf from above.

2) Your minimization of (np.log(z[mask]) - np.log(y[mask])) ** 2.0 is ~~exactly the same as minimizing~~ essentially just finding (z[mask] / y[mask]) close to 1.0 because log is monotonic. That said, why is this minimizing the difference of the logs rather than the difference of the quantities themselves?

3) Your function computes n = 100000 approximations and integrations, but then immediately throws away 95% of these results to trim the domain to ~4900 points in [10, 500] or [-500, -10]. Why?

4) You seem to be assuming that your CDF tail approximation always approaches the true CDF from below in computing 1.0 - _approximate(x, alpha, beta, cdf=True), but this is often not the case. See the figure below.

levy_approximate_cdf

This is where your all-NaN issue seems to come from (you'll be passing negative values into np.log). I think this error arises from https://github.com/josemiotto/pylevy/blob/64c525f273d00d89cbbe531a6557b17b74d18f88/levy/__init__.py#L339-L347 which I assume is meant to be computing the asymptotic behavior of the PDF stable_asymptotic_behavior where you've factored out gamma(alpha+1) = alpha*gamma(alpha) in the else clause.

In the cdf=true branch, how was this approximation computed? Is this in the literature somewhere? ~~At a glance, naive (asymptotic) integration of the PDF here seems to suggest a different formula.~~ It seems I misread the code here, but I still think you are missing a sign(x) or abs(x) in the CDF tail approximation.

ragibson commented 4 years ago

This should be clear from the levy.approximate figure above, but pylevy's left tail CDF computations are clearly incorrect.

>>> levy.levy(-1000, alpha=2.0, beta=0.0, cdf=True)
1.0
>>> levy.levy(-1000, alpha=1.0, beta=0.0, cdf=True)
1.0003183098861839
>>> levy.levy(-1000, alpha=0.5, beta=0.0, cdf=True)
1.0126156626101008

These should all be nearly zero (and definitely not larger than 1.0).

josemiotto commented 4 years ago

Hi! Don't be sorry ;) Many questions. I will proceed in order. The upper and lower limits are what they are, they are not incorrect. Those "extreme" discontinuities are in the order of 5e-3 in the plot you shown, and they are intended to be there; not only that, I guess they are unavoidable if you want to use the tail approximation. If you don't want to use it, you are welcome to do a loglog plot of the levy in Mathematica to see why this is done.

Now, regarding to the method:

The Levy PDF is not analytically tractable, so I am not sure of what calculation could be done here.
Because in the tails we always measure thing in the log scale, but there should be no impact.
That's not the case. The numerical integration is computed, and replaced by the approximation in an optimal range (and is precomputed, so no cost for the user). The grid is nonlinear, so there is more density in the center of the distribution, meaning that the amount of points thrown away is low. Also, they have lower quality.
The approximation is (again) in Nolan. I will check the higher-than-one cdf, which indeed should not happen

ragibson commented 4 years ago

1 and 2) Ah, I assumed you would minimize the difference directly, so the calculation would likely just be to switch to the tail approximation as far from the mean as possible (assuming the numerical integration behaved well). I'm not sure if anything can be done with the log transformation.

I imagined you would splice the tail approximation onto the results from numerical integration to remove discontinuities and then perhaps rescale so that the CDF approaches 1 (or 0 to the left). I'm not sure if this would be good in practice and I guess there's not much that can be done if you just switch over to the tail approximation after some limit.

3) Let me clarify by means of an example. E.g. in a _get_closest_approx(alpha=1.0, beta=0.0, upper=False) call,

The first few lines set n=100000, x1=-9950.0, x2=50.0, li1=-500, li2=-10, dx = 0.1
You set x = np.linspace(x1, x2, num=n + 1, endpoint=True), an array of length 100001 and run _int_levy and _approximate on this array
You then mask out everything using mask = (li1 < x) & (x < li2) to a domain of [-500, -10], but the original array range is [-9950, 50]. This drops 95% of the computations, no?
In this case, np.isnan(np.log(z[mask])).all() is True, which is what I meant by the "all-NaN issue". I think this is related to the cdf calculation (_approximate returning values larger than 1.0 causes np.log to return nan).

ragibson commented 4 years ago

You may be right that pylevy only really gives results accurate to two decimal places or 5e-3 in general.

I just found that levy.levy(-0.97, alpha=0.5, beta=1.0) is around -0.00111 (negative!), even though it should be around 4.4e-6.

It seems that this instance may be cubic interpolation behaving poorly.

josemiotto commented 4 years ago

Hi.

I checked your point 3. In deed, most of the points in the grid are dropped. The reason for this is "historical": when you look at the integrated curve, there is a range from which it's behaviour is very bad. The 10,500 range is a hand-waved range in which is the limit can be found and the curve is not noisy yet. However, np.isnan(np.log(z[mask])).all() is not True; why would it be? is the log of an analytic function.

With respect to the negative values: I have checked what is the min of the pdf on the grid points (so, the ones computed by the integration). The absolute lower value found is -5.44 E-8, on the grid point x=0.99068808, alpha=0.5, beta=-1.0. I guess you are right in that the interpolation amplifies this, giving a larger number in magnitude (your -1.1E-3). I don't see any good solution to this, other than floor the values to 0, since the negative values come from the integration, not the interpolation.

With respect to the larger than 1 cdf numbers, I found the source of the error, which is rather obvious, as you pointed out is in the _approximate function. It happens only when we look at the negative x, for that reason. I fixed it by applying a different formula at the positives and the negatives. Now it works as expected, pushing a commit soon, since I have to rerun the limits calculation.

ragibson commented 4 years ago

Great, thanks for looking into this!

ragibson commented 4 years ago

One more thing that I noticed very recently: I think your integrand for alpha = 1.0 uses a different parameterization than alpha != 1.0. See the figures below.

pylevy_alpha1_issue_pdf pylevy_alpha1_issue_cdf

josemiotto commented 4 years ago

Indeed, I also noticed. I will check what's going on there

On Wed, 12 Aug 2020, 14:39 Ryan Gibson, notifications@github.com wrote:

One more thing that I noticed very recently: I think your integrand for alpha = 1.0 uses a different parameterization that alpha != 1.0. See the figures below.

[image: pylevy_alpha1_issue_pdf] https://user-images.githubusercontent.com/14023456/90015728-fe5d4b00-dc76-11ea-8b2d-acb04f2f715a.png [image: pylevy_alpha1_issue_cdf] https://user-images.githubusercontent.com/14023456/90015732-00270e80-dc77-11ea-9ff2-7cdba4e69ecd.png

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/josemiotto/pylevy/issues/15#issuecomment-672845726, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACCTUXM7DX2X3SUGANZEEDSAKEOLANCNFSM4PCSIZEQ .

ragibson commented 4 years ago

Ah, I missed your comment about the all-NaN issue. If I add

print(f"all nan? {np.isnan(np.log(z[mask])).all()}")

into the function I get

>>> from levy import _get_closest_approx
>>> _get_closest_approx(alpha=1.0, beta=0.0, upper=False)
all nan? True
-500.0

But this was due to the incorrect approximation calculation. Presumably, you've already fixed this issue.

ragibson commented 4 years ago

Also, it is certainly the interpolation, not the integration that causes the "larger" negative values.

The integration is fairly well behaved on this region. See below for the example with alpha=0.5, beta=1.0. pylevy_interpolation_issue

ragibson commented 4 years ago

However, I agree that there is not much that can be done without reworking the grid or interpolation here.

josemiotto / pylevy

Computation of cdf, tail approximation, and limits is incorrect #15