Closed ragibson closed 4 years ago
This should be clear from the levy.approximate
figure above, but pylevy's left tail CDF computations are clearly incorrect.
>>> levy.levy(-1000, alpha=2.0, beta=0.0, cdf=True)
1.0
>>> levy.levy(-1000, alpha=1.0, beta=0.0, cdf=True)
1.0003183098861839
>>> levy.levy(-1000, alpha=0.5, beta=0.0, cdf=True)
1.0126156626101008
These should all be nearly zero (and definitely not larger than 1.0).
Hi! Don't be sorry ;) Many questions. I will proceed in order. The upper and lower limits are what they are, they are not incorrect. Those "extreme" discontinuities are in the order of 5e-3 in the plot you shown, and they are intended to be there; not only that, I guess they are unavoidable if you want to use the tail approximation. If you don't want to use it, you are welcome to do a loglog plot of the levy in Mathematica to see why this is done.
Now, regarding to the method:
1 and 2) Ah, I assumed you would minimize the difference directly, so the calculation would likely just be to switch to the tail approximation as far from the mean as possible (assuming the numerical integration behaved well). I'm not sure if anything can be done with the log transformation.
I imagined you would splice the tail approximation onto the results from numerical integration to remove discontinuities and then perhaps rescale so that the CDF approaches 1 (or 0 to the left). I'm not sure if this would be good in practice and I guess there's not much that can be done if you just switch over to the tail approximation after some limit.
3) Let me clarify by means of an example. E.g. in a _get_closest_approx(alpha=1.0, beta=0.0, upper=False)
call,
n=100000
, x1=-9950.0
, x2=50.0
, li1=-500
, li2=-10
, dx = 0.1
x = np.linspace(x1, x2, num=n + 1, endpoint=True)
, an array of length 100001 and run _int_levy
and _approximate
on this arraymask = (li1 < x) & (x < li2)
to a domain of [-500, -10], but the original array range is [-9950, 50]. This drops 95% of the computations, no?np.isnan(np.log(z[mask])).all()
is True, which is what I meant by the "all-NaN issue". I think this is related to the cdf calculation (_approximate
returning values larger than 1.0 causes np.log
to return nan).You may be right that pylevy only really gives results accurate to two decimal places or 5e-3
in general.
I just found that levy.levy(-0.97, alpha=0.5, beta=1.0)
is around -0.00111
(negative!), even though it should be around 4.4e-6
.
It seems that this instance may be cubic interpolation behaving poorly.
Hi.
I checked your point 3. In deed, most of the points in the grid are dropped. The reason for this is "historical": when you look at the integrated curve, there is a range from which it's behaviour is very bad. The 10,500 range is a hand-waved range in which is the limit can be found and the curve is not noisy yet. However, np.isnan(np.log(z[mask])).all() is not True; why would it be? is the log of an analytic function.
With respect to the negative values: I have checked what is the min of the pdf on the grid points (so, the ones computed by the integration). The absolute lower value found is -5.44 E-8, on the grid point x=0.99068808, alpha=0.5, beta=-1.0. I guess you are right in that the interpolation amplifies this, giving a larger number in magnitude (your -1.1E-3). I don't see any good solution to this, other than floor the values to 0, since the negative values come from the integration, not the interpolation.
With respect to the larger than 1 cdf numbers, I found the source of the error, which is rather obvious, as you pointed out is in the _approximate function. It happens only when we look at the negative x, for that reason. I fixed it by applying a different formula at the positives and the negatives. Now it works as expected, pushing a commit soon, since I have to rerun the limits calculation.
Great, thanks for looking into this!
One more thing that I noticed very recently: I think your integrand for alpha = 1.0 uses a different parameterization than alpha != 1.0. See the figures below.
Indeed, I also noticed. I will check what's going on there
On Wed, 12 Aug 2020, 14:39 Ryan Gibson, notifications@github.com wrote:
One more thing that I noticed very recently: I think your integrand for alpha = 1.0 uses a different parameterization that alpha != 1.0. See the figures below.
[image: pylevy_alpha1_issue_pdf] https://user-images.githubusercontent.com/14023456/90015728-fe5d4b00-dc76-11ea-8b2d-acb04f2f715a.png [image: pylevy_alpha1_issue_cdf] https://user-images.githubusercontent.com/14023456/90015732-00270e80-dc77-11ea-9ff2-7cdba4e69ecd.png
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/josemiotto/pylevy/issues/15#issuecomment-672845726, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACCTUXM7DX2X3SUGANZEEDSAKEOLANCNFSM4PCSIZEQ .
Ah, I missed your comment about the all-NaN issue. If I add
print(f"all nan? {np.isnan(np.log(z[mask])).all()}")
into the function I get
>>> from levy import _get_closest_approx
>>> _get_closest_approx(alpha=1.0, beta=0.0, upper=False)
all nan? True
-500.0
But this was due to the incorrect approximation calculation. Presumably, you've already fixed this issue.
Also, it is certainly the interpolation, not the integration that causes the "larger" negative values.
The integration is fairly well behaved on this region. See below for the example with alpha=0.5, beta=1.0.
However, I agree that there is not much that can be done without reworking the grid or interpolation here.
Sorry for opening up so many issues and for the length of this issue in particular. I was trying to pin down why pylevy's results have extreme discontinuities (the pdf should be infinitely smooth). See below for an example.
When computing your limit data files, you call
_get_closest_approx
, seemingly to find where the analytic tail approximation matches the numerical integration results most closely. However, this seems to compute the resultinglower_limit
andupper_limit
arrays incorrectly.For example, if you call e.g.
_get_closest_approx(alpha=1.0, beta=0.0, upper=False)
, you'll note that https://github.com/josemiotto/pylevy/blob/64c525f273d00d89cbbe531a6557b17b74d18f88/levy/__init__.py#L390-L402 fails becausez[mask]
is all NaN, sonp.argmin
returns 0.1) Couldn't you compute this analytically? E.g. when x > 0, the pdf's asymptotic behavior is monotonically decreasing -- it must approach the true pdf from above.
2) Your minimization of
(np.log(z[mask]) - np.log(y[mask])) ** 2.0
isexactly the same as minimizingessentially just finding(z[mask] / y[mask])
close to 1.0 because log is monotonic. That said, why is this minimizing the difference of the logs rather than the difference of the quantities themselves?3) Your function computes
n = 100000
approximations and integrations, but then immediately throws away 95% of these results to trim the domain to ~4900 points in [10, 500] or [-500, -10]. Why?4) You seem to be assuming that your CDF tail approximation always approaches the true CDF from below in computing
1.0 - _approximate(x, alpha, beta, cdf=True)
, but this is often not the case. See the figure below.This is where your all-NaN issue seems to come from (you'll be passing negative values into
np.log
). I think this error arises from https://github.com/josemiotto/pylevy/blob/64c525f273d00d89cbbe531a6557b17b74d18f88/levy/__init__.py#L339-L347 which I assume is meant to be computing the asymptotic behavior of the PDF where you've factored outgamma(alpha+1) = alpha*gamma(alpha)
in the else clause.In the
cdf=true
branch, how was this approximation computed? Is this in the literature somewhere?At a glance, naive (asymptotic) integration of the PDF here seems to suggest a different formula.It seems I misread the code here, but I still think you are missing a sign(x) or abs(x) in the CDF tail approximation.