Modeling asymmetrical distributions when alpha approximates 1.0 and improve the approximation of the tails with the Pareto distribution

josemiotto / pylevy

Levy distributions for Python

GNU General Public License v3.0

34 stars 18 forks source link

Modeling asymmetrical distributions when alpha approximates 1.0 and improve the approximation of the tails with the Pareto distribution #11

Closed Nitram-GGA closed 4 years ago

Nitram-GGA commented 4 years ago

Hi, I wrote my own stable distribution package because scipy's was not working (around 2017). However, it is written fully in python, which makes the integration process quite slow. I was very happy when I saw that someone wrote a faster code for python. Thus, thank you for making it available.

I want to start using it in my applications, however, my first test did not throw good results for certain ranges. Could you tell me what I am doing wrong (maybe something to do with the parameterizations), or if this is an issue in your code? (Sorry I did not check the code myself).

Below, I put the results of computing the pdf with X = [-1, 0, 1]. I include the results you would get using the S parameterization (option 1) in the stable.exe program of Nolan. The results for different values of alpha and beta are shown as explained below:

alpha beta [' ', ' ', ' '] # Results of stable.exe [' ', ' ', ' '] # My code - stblpdf() [' ', ' ', ' '] # scipy's new code - levy_stable.pdf() [' ', ' ', ' '] # your code - levypdf()

First I define the functions

import numpy as np  
# Scipy's new function
from scipy.stats import levy_stable 
# my code
from stblpdf import stblpdf 
# your code
from levy import levy, Parameters 
def levypdf(XX, alpha,beta,gamma,delta):
    alpha2, beta2, delta2, gamma2 = Parameters.convert([alpha,beta,delta,gamma],'1','0')
    output = levy(XX, alpha2, beta2, mu=delta2, sigma=gamma2, cdf=False)
    return output

# Printing results function
def makepdfs(alpha):
    X = np.array([-1,0,1])
    for ai in [alpha]:
        for bi in [0,0.5,1]:
            print ('{0: 5.2f} {1: 5.2f} '.format(ai,bi))
            print (['{:8.6f} '.format(ix) for ix in stblpdf(X,ai,bi,1,0)])
            print (['{:8.6f} '.format(ix) for ix in levy_stable.pdf(X,ai,bi,loc=0,scale=1)])
            print (['{:8.6f} '.format(ix) for ix in levypdf(X, ai,bi,1,0)])
            print ('')
    return

Now, a first example with alpha = 1.1 to verify that everything works fine (Note that I added manually the row of outputs of stable.exe

makepdfs(1.1)
#  1.10  0.00 
# ['0.170890 ', '0.307141 ', '0.170890 '] # stable.exe
# ['0.170890 ', '0.307141 ', '0.170890 ']
# ['0.170890 ', '0.307141 ', '0.170890 ']
# ['0.170890 ', '0.307141 ', '0.170890 ']
# 
#  1.10  0.50 
# ['0.076629 ', '0.042328 ', '0.025782 '] # stable.exe  
# ['0.076629 ', '0.042328 ', '0.025782 ']
# ['0.076629 ', '0.042328 ', '0.025782 ']
# ['0.076629 ', '0.042328 ', '0.025783 '] # here a slight difference
# 
#  1.10  1.00 
# ['0.022380 ', '0.016023 ', '0.011925 '] # stable.exe  
# ['0.022380 ', '0.016023 ', '0.011925 ']
# ['0.022380 ', '0.016023 ', '0.011925 ']
# ['0.022380 ', '0.016024 ', '0.011925 ']

Now a value closer to 1.0 to see how it starts to differ

makepdfs(1.02)
#  1.02  0.00 
# ['0.161626 ', '0.315721 ', '0.161626 '] # stable.exe  
# ['0.161626 ', '0.315721 ', '0.161626 ']
# ['0.161626 ', '0.315721 ', '0.161626 ']
# ['0.161626 ', '0.315721 ', '0.161626 ']
# 
#  1.02  0.50 
# ['0.002201 ', '0.001928 ', '0.001702 '] # stable.exe  
# ['0.002201 ', '0.001928 ', '0.001702 ']
# ['0.000000 ', '0.001928 ', '0.000000 '] # scipy fails for x!=0
# ['0.002051 ', '0.001799 ', '0.001591 '] # your code starts to throw smaller values
# 
#  1.02  1.00 
# ['0.000698 ', '0.000653 ', '0.000613 '] # stable.exe  
# ['0.000698 ', '0.000653 ', '0.000613 ']
# ['0.000000 ', '0.000653 ', '0.000000 '] # scipy fails for x!=0
# ['0.000631 ', '0.000591 ', '0.000556 '] # your code starts to throw smaller values

A graph for alpha=1.02 and beta=0.5

import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
XXX = np.linspace(-7,7,num = 1001,dtype=np.float64)
fig, ax = plt.subplots()
ax.plot(XXX, stblpdf(XXX,1.02,0.5,1,0,False),'-',color='k')
ax.plot(XXX, levypdf(XXX,1.02,0.5,1,0),'-',color='r')
plt.show()

Untitled

As you can see around x=-6 your code has a sudden drop, while mine has problems with Nolan's integrals around 0. I know it's better to round alpha = 1.02 to 1.0 to avoid this behavior, but then this happened for alpha = 1.0:

makepdfs(1.0)
#  1.00  0.00 
# ['0.159155 ', '0.318310 ', '0.159155 '] # stable.exe  
# ['0.159155 ', '0.318310 ', '0.159155 ']
# ['0.159155 ', '0.318310 ', '0.159155 ']
# ['0.159155 ', '0.318310 ', '0.159155 ']
# 
#  1.00  0.50 
# ['0.179278 ', '0.292520 ', '0.159936 '] # stable.exe  
# ['0.179278 ', '0.292520 ', '0.159936 ']
# ['0.179278 ', '0.292520 ', '0.159936 ']
# ['0.000000 ', '0.000000 ', '0.000000 '] # your code throws all zeros in this range
# 
#  1.00  1.00 
# ['0.221762 ', '0.262240 ', '0.163531 '] # stable.exe  
# ['0.221762 ', '0.262240 ', '0.163531 '] 
# ['0.221762 ', '0.262240 ', '0.163531 ']
# ['0.000000 ', '0.000000 ', '0.000000 '] # your code throws all zeros in this range

In conclusion, I will use your code for applications with alpha>=1.05 because it is a 1000 times faster in some cases (which I cannot thank you enough). For values 1.0<=alpha<1.05 and beta!=0, the discrepancies with the outputs of stable.exe start to become more noticeable. Furthermore, they cannot be rounded to alpha=1.0. Therefore, since I have not seen your code, I am not sure if this is a parameterization problem or some other issue. It would be a good idea if you can take a look at it.

Also, I want to ask, I read in the Documentation that "It operates by interpolating values from a table, as direct computation of these distributions requires a lengthy numerical integration. This interpolation scheme allows fast fitting of data by Maximum Likelihood" . Does this mean that it only works for a certain closed range of X?

I have not uploaded my code anywhere because it is just a language conversion with minor adjustments of the code written by Mark Veillette. However, I am attaching it in case you want to take a look. stblpdf.py.txt

josemiotto commented 4 years ago

Hi! You are welcome, and thanks for this analysis, since I never compared it with Nolan. The results you send me are similar to what I got at the time comparing with mathematica

In essence, integration is unstable in a certain range of x. I solve that because I don't integrate. First, the values are interpolated. So if you want x=2.01, it will get the values of x=2 and x=2.05 (for example, it depends on the grid size) and interpolate. Same for alpha and beta. You can create a finer grid with the command _make_dist_data_file, if you need. Just change the array size.

Second, I use the power law approximation in the range x in (10, 500). This is analytical, and I look for the place where the difference between the integrated and the approximation is minimal, which I call limit and is computed in _make_limit_data_file. That is the jump you see.

So, yes, this are conscious limitations, because the goal of the package is to speed up the max likelihood minimization, for which the jump has little consequences.

I think if you want to reduce the jump you would have to make the grid finer, which will imply a slow down of the algorithm. If you do that and you measure the times, it would be great, is something I haven't checked.

Let me know, and thanks for the msg, José

On Wed, 28 Aug 2019, 10:55 NitramNK, notifications@github.com wrote:

Hi, I wrote my own stable distribution package because scipy's was not working (around 2017). However, it is written fully in python, which makes the integration process quite slow. I was very happy when I saw that someone wrote a faster code for python. Thus, thank you for making it available.

I want to start using it in my applications, however, my first test did not throw good results for certain ranges. Could you tell me what I am doing wrong (maybe something to do with the parameterizations), or if this is an issue in your code? (Sorry I did not check the code myself).

Below, I put the results of computing the pdf with X = [-1, 0, 1]. I include the results you would get using the S parameterization (option 1) in the stable.exe program of Nolan http://fs2.american.edu/jpnolan/www/stable/stable.html. The results for different values of alpha and beta are shown as explained below:

alpha beta [' ', ' ', ' '] # Results of stable.exe [' ', ' ', ' '] # My code - stblpdf() [' ', ' ', ' '] # scipy's new code - levy_stable.pdf() [' ', ' ', ' '] # your code - levypdf()

First I define the functions

import numpy as np # Scipy's new functionfrom scipy.stats import levy_stable # my codefrom stblpdf import stblpdf # your codefrom levy import levy, Parameters def levypdf(XX, alpha,beta,gamma,delta): alpha2, beta2, delta2, gamma2 = Parameters.convert([alpha,beta,delta,gamma],'1','0') output = levy(XX, alpha2, beta2, mu=delta2, sigma=gamma2, cdf=False) return output

Printing results functiondef makepdfs(alpha):
X = np.array([-1,0,1])
for ai in [alpha]:
    for bi in [0,0.5,1]:
        print ('{0: 5.2f} {1: 5.2f} '.format(ai,bi))
        print (['{:8.6f} '.format(ix) for ix in stblpdf(X,ai,bi,1,0)])
        print (['{:8.6f} '.format(ix) for ix in levy_stable.pdf(X,ai,bi,loc=0,scale=1)])
        print (['{:8.6f} '.format(ix) for ix in levypdf(X, ai,bi,1,0)])
        print ('')
return
Now, a first example with alpha = 1.1 to verify that everything works fine (Note that I added manually the row of outputs of stable.exe

makepdfs(1.1)# 1.10 0.00 # ['0.170890 ', '0.307141 ', '0.170890 '] # stable.exe# ['0.170890 ', '0.307141 ', '0.170890 ']# ['0.170890 ', '0.307141 ', '0.170890 ']# ['0.170890 ', '0.307141 ', '0.170890 ']# # 1.10 0.50 # ['0.076629 ', '0.042328 ', '0.025782 '] # stable.exe # ['0.076629 ', '0.042328 ', '0.025782 ']# ['0.076629 ', '0.042328 ', '0.025782 ']# ['0.076629 ', '0.042328 ', '0.025783 '] # here a slight difference# # 1.10 1.00 # ['0.022380 ', '0.016023 ', '0.011925 '] # stable.exe # ['0.022380 ', '0.016023 ', '0.011925 ']# ['0.022380 ', '0.016023 ', '0.011925 ']# ['0.022380 ', '0.016024 ', '0.011925 ']

Now a value closer to 1.0 to see how it starts to differ

makepdfs(1.02)# 1.02 0.00 # ['0.161626 ', '0.315721 ', '0.161626 '] # stable.exe # ['0.161626 ', '0.315721 ', '0.161626 ']# ['0.161626 ', '0.315721 ', '0.161626 ']# ['0.161626 ', '0.315721 ', '0.161626 ']# # 1.02 0.50 # ['0.002201 ', '0.001928 ', '0.001702 '] # stable.exe # ['0.002201 ', '0.001928 ', '0.001702 ']# ['0.000000 ', '0.001928 ', '0.000000 '] # scipy fails for x!=0# ['0.002051 ', '0.001799 ', '0.001591 '] # your code starts to throw smaller values# # 1.02 1.00 # ['0.000698 ', '0.000653 ', '0.000613 '] # stable.exe # ['0.000698 ', '0.000653 ', '0.000613 ']# ['0.000000 ', '0.000653 ', '0.000000 '] # scipy fails for x!=0# ['0.000631 ', '0.000591 ', '0.000556 '] # your code starts to throw smaller values

A graph for alpha=1.02 and beta=0.5

import matplotlibimport matplotlib.pyplot as plt%matplotlib inlineXXX = np.linspace(-7,7,num = 1001,dtype=np.float64) mine = his = levypdf(XXX,1.02,0.5,1,0)# his = levy(XXX, 1.1,0.5, mu=0, sigma=1, cdf=False)# print (XXX) fig, ax = plt.subplots() ax.plot(XXX, stblpdf(XXX,1.02,0.5,1,0,False),'-',color='k') ax.plot(XXX, his,'-',color='r') plt.show()

[image: Untitled] https://user-images.githubusercontent.com/27948079/63837820-22b50e80-c9b7-11e9-9e5a-ef6586ad8a8b.jpg

As you can see around x=-6 your code has a sudden drop, while mine has problems with Nolan's integrals around 0. I know it's better to round alpha = 1.02 to 1.0 to avoid this behavior, but then this happened for alpha = 1.0:

makepdfs(1.0)# 1.00 0.00 # ['0.159155 ', '0.318310 ', '0.159155 '] # stable.exe # ['0.159155 ', '0.318310 ', '0.159155 ']# ['0.159155 ', '0.318310 ', '0.159155 ']# ['0.159155 ', '0.318310 ', '0.159155 ']# # 1.00 0.50 # ['0.179278 ', '0.292520 ', '0.159936 '] # stable.exe # ['0.179278 ', '0.292520 ', '0.159936 ']# ['0.179278 ', '0.292520 ', '0.159936 ']# ['0.000000 ', '0.000000 ', '0.000000 '] # your code throws all zeros in this range# # 1.00 1.00 # ['0.221762 ', '0.262240 ', '0.163531 '] # stable.exe # ['0.221762 ', '0.262240 ', '0.163531 '] # ['0.221762 ', '0.262240 ', '0.163531 ']# ['0.000000 ', '0.000000 ', '0.000000 '] # your code throws all zeros in this range

In conclusion, I will use your code for applications with alpha>=1.05 because it is a 1000 times faster in some cases (which I cannot thank you enough). For values 1.0<=alpha<1.05 and beta!=0, the discrepancies with the outputs of stable.exe start to become more noticeable. Furthermore, they cannot be rounded to alpha=1.0. Therefore, since I have not seen your code, I am not sure if this is a parameterization problem or some other issue. It would be a good idea if you can take a look at it.

Also, I want to ask, I read in the Documentation that "It operates by interpolating values from a table, as direct computation of these distributions requires a lengthy numerical integration. This interpolation scheme allows fast fitting of data by Maximum Likelihood" . Does this mean that it only works for a certain closed range of X?

I have not uploaded my code anywhere because it is just a language conversion with minor adjustments of the code written by Mark Veillette http://math.bu.edu/people/mveillet/html/alphastablepub.html. However, I am attaching it in case you want to take a look. stblpdf.py.txt https://github.com/josemiotto/pylevy/files/3549928/stblpdf.py.txt

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/josemiotto/pylevy/issues/11?email_source=notifications&email_token=AACCTUWXFGKU746KXED3AE3QGY4O5A5CNFSM4IQ7YXY2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HH3QKNQ, or mute the thread https://github.com/notifications/unsubscribe-auth/AACCTUSPFCHJKUMXTQHLZQDQGY4O5ANCNFSM4IQ7YXYQ .

Nitram-GGA commented 4 years ago

Thank you for your quick reply. I still have not seen the code, but I get the idea. I will try to time it with a finer grid whenever I find the time. P.S., if I use your code and write a paper, do you have any reference for this code? Thanks

josemiotto commented 4 years ago

if you want to cite, you can do it as J. M. Miotto, pylevy, http://dx.doi.org/10.5281/zenodo.53787 (2016).

for the new grid, this should work

import pylevy

pylevy.size = (200, 76, 101)  
# this is the default, which you should change. It's 200 iterations for x (actually in the Fourier space), and then the separation of the grid for alpha and beta is 0.02 (because alpha goes from 0.5 to 2, and beta from -1 to 1).

pylevy._make_dist_data_file()  # creates the grid (takes long)
pylevy._make_limit_data_file()  # creates the limits between the interpolation and the approximation

josemiotto commented 4 years ago

You can also experiment with the _get_closest_approx function, which is the one that calculates the limit. It has a couple of parameters that are decided in base of convenience, but not optimized. If you find a better system to decide the limit, please let me know.

Nitram-GGA commented 4 years ago

I read the code, finally. Making the grid finer for x would make it more precise, but it does not solve the main issues, which are the ones that appear in the new title of this issue (Sorry I changed it). Also, working with tan(x) is quite clever and allows you to keep the grid as is and still have accurate enough results.

So, the first issue is the results you get when α = 1.0 and β ≠ 0. To illustrate the issue, I made the following graphs for different combinations of α and β. The dotted black lines are computed with my python code (which, like I said earlier, are just an adjustment of the Veillette ones written in Matlab, which at the same time are based on Nolan's 1997 paper). The green lines are the pdf computed with pylevy's _calculate_levy() function, and the dashed red line is the absolute value of the difference. Note that the vertical scale is in logarithmic scale. I could have just explained the issue, but the graphs may help document how well _calculate_levy() works.

For β = 0.0:

For β = 0.5:

For β = 1.0:

For α ≠ 1.0, the dashed red line is almost always smaller than 1e-4. I did not check the exceptions (α = 0.3), but they happen at a few specific values of x and are always smaller than 0.01 (Most likely a difference in the tolerance defined in the computation of the integral).

From these results, it is evident that _calculate_levy() does not support α = 1.0 and β ≠ 0. Actually, it always returns the pdf of the Cauchy distribution (α = 1.0 and β = 0). On the bright side, note how well it performs when α is as small as 0.3. You could make your range of applicability wider.

To solve the first issue, I propose to compute the pdf for α = 1.0 and β ≠ 0 by interpolation and making your grid of alphas finer. In other words, define alphas as

alphas = np.linspace(0.3,2.0,num=171)-0.005
alphas = np.append(alphas,2.0)

This means that your grid would not have values for α = 1.0. Thus, it remains to verify how well the interpolation between α = 0.995 and 1.005 would approximate the actual values. Take a look at the short code below. To get the values of the pdf for α = 1.0 I am not using the cubic interpolation used in pylevy, just a simple linear interpolation between the pdfs for α = 0.995 and 1.005. Then I compare it with the Nolan's integrals. I repeat the same excercise for several values of β.

    import numpy as np
    import matplotlib.pyplot as plt
    from levy import _calculate_levy
    from stblpdf import stblpdf

    betas = [0.0, 0.02, 0.5, 0.75, 1.0]
    XXX = np.arange(-15.05,15.05,0.1)
    fig, ax = plt.subplots(1,5,sharey=True,figsize=(12,2.5))
    for ib, beta in enumerate(betas):
        p1 = np.array([_calculate_levy(t,1.003,beta, False) for t in XXX])
        p2 = np.array([_calculate_levy(t,0.997,beta, False) for t in XXX])
        pInt = (p1+p2)/2
        p3 = stblpdf(XXX,1.0,beta,1.0,0.0,param=0)
        ax[ib].plot(XXX, abs(p3-pInt),'--',color='r',lw=1.2)
        ax[ib].set_yscale('log')
        ax[ib].set_ylim(1e-12,0.01)
        ax[ib].text(0.98,0.98,'beta = {:4.2f}'.format(beta),
                    ha='right',va='top',transform=ax[ib].transAxes)
    plt.show()

Even though there is a peak in the graph of β = 0.02 at about x = -4, it is smaller than 0.001. The peak gets smaller as β approaches 1.0. This means that the peak will get larger as &beta becomes 0.0. However, pylevy interpolates between β = 0.00 and 0.02, thus, the peak (i.e., max error of approximation) would not get larger.

I did the same analysis for the CDF, with similar results. Below, the graph for the errors of approximation (always smaller than 1e-5).

For the second issue (tail approximation), I will post something later, and maybe open a separate issue.

josemiotto commented 4 years ago

Sorry, I didn't reply the part about alpha close to 1. The reality is, that alpha=1 is a special case in the integration, and that is why it fails. This is a known bug of the code, that exists because of an error of me: the original code had the grid slightly displaced from its actual location, allowing the interpolation that you are suggesting, which I changed without realizing. So, we have now two options: one is to move the default grid slightly, that would be the easiest. The second option, would be to calculate the integral as a special case (there is a different, simpler, formula for alpha=1, I'd need to implement it).

josemiotto commented 4 years ago

By the way, the analysis is great! The ones I did were not so systematic.

Nitram-GGA commented 4 years ago

Exactly. My first suggestion was going to be to include the integrals of the special case α = 1. However, I wanted to verify how much accuracy you would be risking if interpolating between 0.995 and 1.005. By the way interpolating between 0.99 and 1.01 yields errors greater than 0.01. Because the error of approximation using the grid of alphas I suggested above is always smaller than 0.001, I do believe it's a quick workaround and fixes your bug.

Now, for the sake of the completeness of the _calculate_levy() function, probably it would be best to include the special case.

In any case, don't forget to check the variables convert_to_par0 and convert_from_par0. I didn't look into it yet, but I think you need to add an if-statement in the functions _psi() and _phi().

Nitram-GGA commented 4 years ago

With regard to the jumps. There is a way of improving the current way in which you are defining the limits.

With pylevy.size = (200, 76, 101) , I re-ran pylevy._make_dist_data_file() and pylevy._make_limit_data_file(). It gives me always limit = 10. I plotted the absolute error you would get when comparing the _calculate_levy() function and the levy(cdf=False) main function. Below are some examples for several values of α and β = -0.5.

I think the problem is the logarithms in

def _get_closest_approx(alpha, beta):
    x0, x1, n = -50.0, 10000.0 - 50.0, 100000
    dx = (x1 - x0) / n
    x = np.linspace(x0, x1, num=n, endpoint=True)
    y = 1.0 - _int_levy(x, alpha, beta, cdf=True)
    z = 1.0 - _approximate(x, alpha, beta, cdf=True)
    mask = (10.0 < x) & (x < 500.0)
    return 10.0 + dx * np.argmin((np.log(z[mask]) - np.log(y[mask])) ** 2.0)

You are evaluating using the square of the difference of log(z) - log(y). z and y are smaller than 1, which means that the log values are negative. The closer these values approximate to zero, the more the logarithms tend to -inf. Even if z and y become closer, the square of the difference of the logarithms can get pretty big. This means that as you go farther into the tail the difference can get larger, instead of smaller as you are hoping.

There is a paper from 1998 in which they analyze the point at which the ratio of the stable density and the pareto density is within 0.9 and 1.1. These points or values of x are shown in Figure 2 in page 5. I was able to reproduce the graph of the pdf using your _calculate_levy() function in the code below:

import numpy as np
import matplotlib.pyplot as plt
from levy import _calculate_levy

alphas = np.linspace(0.5,2.0,num=151)-0.005
betas = np.array([-0.9,-0.6,-0.3,0.0,0.3,0.6,0.9])
colors = ['#FF0000','#0000FF','#00FF00','k','#00FF00','#0000FF','#FF0000']
# Initialize the output variable
limits = np.zeros((len(betas),len(alphas)),dtype=np.float128)
# Initialize the figure
fig, ax = plt.subplots(figsize=(3,5))
for ib,beta in enumerate(betas):
    for ia,alpha in enumerate(alphas):
        maxX = 300 if alpha<1.0 else 100
        newX = np.arange(0.1,maxX,0.1)
        # Stable density
        p1 = np.array([_calculate_levy(t, alpha, beta, False) for t in newX],dtype=np.float128)
        # Pareto density
        calpha = lambda a: (1/np.pi)*gamma(a)*np.sin(a*np.pi/2)
        paretopdf = lambda x: alpha*(1+beta)*calpha(alpha)/x**(1+alpha)
        p2 = np.array([paretopdf(x) for x in newX],dtype=np.float128)
        # Find the x at which the ratio of both densities are between 0.9 and 1.1
        ratio = abs((p1/p2)-1)
        limits[ib,ia] = np.max(newX[np.where(ratio>0.1)[0]])
        print ('alpha = {0:5.3f} beta = {1:5.1f} limit = {2:6.1f}'.format(alpha,beta,limits[ib,ia]))
    lss = '--' if beta<0 else '-'
    ax.plot(alphas, limits[ib], ls = lss, color=colors[ib], lw=1.0)
ax.set_ylim(0,50)
plt.show()

I cut the Y-axis in x=50 to compare it to the graph of the paper, but FYI for α=0.495 and β=-0.9, the limit is equal to 184.1. The merit of this way of finding the "limit" is that it uses the ratio instead of a difference. The same paper suggests a second way that uses a normalized difference, but it ends up being the same thing (graph) as the ratio. You could use the above code to generate your limits file. The colors I used are intentional. Notice that solid lines correspond to positive betas and dashed lines correspond to negative betas. This graph (or limits) is valid for the right tail of the distribution (x>0). For the left tail (x<0), the same limits can be used, however, the solid lines would correspond to negative betas and the dashed lines to positive betas. Consequently, all pairs of solid and dashed lines cross at α=1, making the solid lines be above the dashed lines if α>1, and vice versa.

I can try pulling a request with these changes (1: finer grid of alphas and 2: new limits-function). However, it would be easier if you change it since you know all the tweaks in your code. Let me know what you think.

josemiotto commented 4 years ago

OK, I will make a deep dive into the issue of the limits; they are usually 10 because I set the limit to be in between 10 and 500, that was handwaving, and I should revise it; at the time, it was a solution because actually the range where the integration works well changes with alpha and beta. However, I don't understand your proposal, since you are looking at the ratio between the densities, and I'm looking at the difference of the logarithms of the densities, which are the same thing! then, how you define the limit is a bit different, so I will think about it. Also, there is a underlying reason for using the difference of the logs, which is that this code's idea is to be optimized for the maximum likelihood procedure, which uses the logarithm of the density, so that is the thing that should be the closest. By the way, in that paper, when you look at x=10, you are usually close to 0 relative error, so is not so bad. If anything, my limit procedure should pick up some higher x, which it doesn't. I will check that as well.

Nitram-GGA commented 4 years ago

which are the same thing!

Yep, sorry about that. Necesito mas cafe.

Then, let's pay attention at why the function _get_closest_approx(alpha, beta) is not working. Let me copy below your function:

def _get_closest_approx(alpha, beta):
    x0, x1, n = -50.0, 10000.0 - 50.0, 100000
    dx = (x1 - x0) / n
    x = np.linspace(x0, x1, num=n, endpoint=True)
    y = 1.0 - _int_levy(x, alpha, beta, cdf=True)
    z = 1.0 - _approximate(x, alpha, beta, cdf=True)
    mask = (10.0 < x) & (x < 500.0)
    return 10.0 + dx * np.argmin((np.log(z[mask]) - np.log(y[mask])) ** 2.0)

Now let's plot y and z as defined above. Additionally let's plot the results of calculating the cdf with your _calculate_levy() function and call it y_true . Also, let's plot the grid of x with which the cdf grid/file was created using vertical gray lines. Also, let's plot x=10 and x=500 with vertical orange lines. And finally let's plot the limit using a vertical green line.

import numpy as np
import matplotlib.pyplot as plt
from levy import _int_levy, _approximate, _calculate_levy

def test_approx(alpha = 1.5,beta = 0):

    x0, x1, n = -50.0, 1000.0 - 50.0, 10000
    dx = (x1 - x0) / n
    x = np.linspace(x0, x1, num=n, endpoint=True)
    # cdf interpolated from grid
    y = 1.0 - _int_levy(x, alpha, beta, cdf=True)
    # cdf calculated with integrals
    y_true = 1.0 - np.array([_calculate_levy(t, alpha, beta, True) for t in x])
    # cdf of the Pareto distribution
    z = 1.0 - _approximate(x, alpha, beta, cdf=True)
    # Calculate limit between 10 and 500
    mask = (10.0 < x) & (x < 500.0)
    limit = 10.0 + dx * np.argmin((np.log(z[mask]) - np.log(y[mask])) ** 2.0)
    # Create figure
    fig, ax = plt.subplots(figsize=(12,5),facecolor='white')
    ax.plot(x,y_true,'-',color='#85c1e9',lw=4.0,zorder=5,label='w/ _calculate_levy()')
    ax.plot(x,y,'--',color='r',lw=2.0,zorder=10,label='Interpolated w/ grid')
    ax.plot(x,z,'--',color='k',lw=1.3,zorder=10,label='Approximated w/ Pareto cdf')
    # Tune the axes
    ax.set_xscale('log')
    ax.set_xlim(0.1,1000)
    ax.set_yscale('log')
    ax.set_ylim(0.00001,1)
    # Plot the grid of x used to do the grids of cdfs
    xgrid = np.tan(np.linspace(-np.pi / 2 * 0.999,np.pi / 2 * 0.999,num=200))
    for xg in xgrid[xgrid>0]:
        ax.axvline(x=xg,ls = '-',color='k',lw=0.5,alpha=0.6,zorder=0)
    # Range in which the limit is evaluated
    ax.axvline(x=10,ls='-',color='#dc7633',lw=2.0,zorder = 0)
    ax.axvline(x=500,ls='-',color='#dc7633',lw=2.0,zorder = 0)
    # The limit
    ax.axvline(x=limit,ls='-',color='g',lw=3.0,zorder = 30,label='Limit={:5.1f}'.format(limit))
    ax.legend(loc=1,facecolor='white',edgecolor='k')
    plt.show()
    return
# Run
test_approx(alpha = 1.5,beta = 0.0)

Having defined xgrid as

xgrid = np.tan(np.linspace(-np.pi / 2 * 0.999,np.pi / 2 * 0.999,num=200))

the last 8 values are 8.89, 10.36, 12.41, 15.44, 20.44, 30.19, 57.66 and 636.62. As expected, the interpolated and true cdfs are almost equal in all the points where they cross the gray lines. As the space between the gray lines gets larger, the interpolation will cause slight differences in between the gray lines. However, the odd thing is the Pareto cdf. It does not look anything like the other two lines and that is why you are getting a limit equal to 10. Because there is where the difference of logs is the smallest.

I think you have a mistake in your _approximate() function. The line

        return 1.0 - values

should be

        return 1.0 - values*x

Running again with this correction

test_approx(alpha = 1.5,beta = 0.0)

Problem solved. I printed the numbers to check that this is the point where the difference of logs is minimum because I was skeptical of why it would not happen sooner. But 52.1 is the point. Therefore, all x<52.1 will be computed with the interpolated grid and all x>52.1 will be computed with the pareto cdf as you were hoping in the first place. However, I saw that the dif of logs in 10<x<52.1 can be larger than in x=52.1. After all, that is how np.argmin works. Here, you are comparing the interpolations (y) with the pareto cdf (z). My proposal was to compare the y_true and z. I don't fully understand why you are doing this, but if you say it helps with the maximum likelihood, then I guess it is OK.

Now let's try another more challenging values of α and β

test_approx(alpha = 0.9,beta = 0.6)

The difference between the y and y_true is more noticeable, but it's no problem since the limit=69.1 will allow to use the pareto cdf for all x>69.1. Until here no problem.

What about negative betas?

test_approx(alpha = 0.9,beta = -0.6)

The limit is 238.3, and corresponds to the point where the dif of logs is smallest. But the interpolated line is way off. These values correspond to β=0.6. Therefore, I think there is another mistake in your function _int_levy(). I changed the line

points[..., 2] = np.abs(beta)

points[..., 2] = beta

Run again

test_approx(alpha = 0.9,beta = -0.6)

Apparently everything looks good now. However, like before, I am guessing that you used the absolute value of β for a reason. Try to check if the change that does not mess anything else in the code.

I re-ran _make_limit_data_file() after these two changes and plotted again the graph comparing the results of the main function levy(cdf=False) and _calculate_levy(). The jumps disappeared, yay! Bear in mind that the central graph corresponding to α=1.00 is still plotting the Cauchy distribution (β=0). I think that is the only issue remaining. Have you decided which solution will you apply? (1. change the grid of alphas without risking accuracy, or 2. complete the _calculate_levy() function with the special case α=1.0 equations)

josemiotto commented 4 years ago

Very very impressive! In deed the error is in approximate. Now that I see it, is obvious, because the cdf should have an exponent 1 lower than the pdf, and it had the same. Multiplying by x solves the issue, thanks. I will check when that code was introduced.

Wrt to the alpha=1, I will introduce the special case.

Thanks again

On Wed, 11 Sep 2019, 10:28 Nitram-GGA, notifications@github.com wrote:

which are the same thing!

Yep, sorry about that. Necesito mas cafe.

Then, let's pay attention at why the function _get_closest_approx(alpha, beta) is not working. Let me copy below your function:

def _get_closest_approx(alpha, beta):
x0, x1, n = -50.0, 10000.0 - 50.0, 100000

dx = (x1 - x0) / n

x = np.linspace(x0, x1, num=n, endpoint=True)

y = 1.0 - _int_levy(x, alpha, beta, cdf=True)

z = 1.0 - _approximate(x, alpha, beta, cdf=True)

mask = (10.0 < x) & (x < 500.0)

return 10.0 + dx * np.argmin((np.log(z[mask]) - np.log(y[mask])) ** 2.0)
Now let's plot y and z as defined above. Additionally let's plot the results of calculating the cdf with your _calculate_levy() function and call it y_true . Also, let's plot the grid of x with which the cdf grid/file was created using vertical gray lines. Also, let's plot x=10 and x=500 with vertical orange lines. And finally let's plot the limit using a vertical green line.

import numpy as np import matplotlib.pyplot as plt from levy import _int_levy, _approximate, _calculate_levy

def test_approx(alpha = 1.5,beta = 0):
x0, x1, n = -50.0, 1000.0 - 50.0, 10000

dx = (x1 - x0) / n

x = np.linspace(x0, x1, num=n, endpoint=True)

# cdf interpolated from grid

y = 1.0 - _int_levy(x, alpha, beta, cdf=True)

# cdf calculated with integrals

y_true = 1.0 - np.array([_calculate_levy(t, alpha, beta, True) for t in x])

# cdf of the Pareto distribution

z = 1.0 - _approximate(x, alpha, beta, cdf=True)

# Calculate limit between 10 and 500

mask = (10.0 < x) & (x < 500.0)

limit = 10.0 + dx * np.argmin((np.log(z[mask]) - np.log(y[mask])) ** 2.0)

# Create figure

fig, ax = plt.subplots(figsize=(12,5),facecolor='white')

ax.plot(x,y_true,'-',color='#85c1e9',lw=4.0,zorder=5,label='w/ _calculate_levy()')

ax.plot(x,y,'--',color='r',lw=2.0,zorder=10,label='Interpolated w/ grid')

ax.plot(x,z,'--',color='k',lw=1.3,zorder=10,label='Approximated w/ Pareto cdf')

# Tune the axes

ax.set_xscale('log')

ax.set_xlim(0.1,1000)

ax.set_yscale('log')

ax.set_ylim(0.00001,1)

# Plot the grid of x used to do the grids of cdfs

xgrid = np.tan(np.linspace(-np.pi / 2 * 0.999,np.pi / 2 * 0.999,num=200))

for xg in xgrid[xgrid>0]:

    ax.axvline(x=xg,ls = '-',color='k',lw=0.5,alpha=0.6,zorder=0)

# Range in which the limit is evaluated

ax.axvline(x=10,ls='-',color='#dc7633',lw=2.0,zorder = 0)

ax.axvline(x=500,ls='-',color='#dc7633',lw=2.0,zorder = 0)

# The limit

ax.axvline(x=limit,ls='-',color='g',lw=3.0,zorder = 30,label='Limit={:5.1f}'.format(limit))

ax.legend(loc=1,facecolor='white',edgecolor='k')

plt.show()

return
Run

test_approx(alpha = 1.5,beta = 0.0)

[image: image] https://user-images.githubusercontent.com/27948079/64674698-5c544200-d4ac-11e9-9ca6-af1e52a97eb9.png

Having defined xgrid as

xgrid = np.tan(np.linspace(-np.pi / 2 0.999,np.pi / 2 0.999,num=200))

the last 8 values are 8.89, 10.36, 12.41, 15.44, 20.44, 30.19, 57.66 and 636.62. As expected, the interpolated and true cdfs are almost equal in all the points where they cross the gray lines. As the space between the gray lines gets larger, the interpolation will cause slight differences in between the gray lines. However, the odd thing is the Pareto cdf. It does not look anything like the other two lines and that is why you are getting a limit equal to 10. Because there is where the difference of logs is the smallest.

I think you have a mistake in your _approximate() function. The line
    return 1.0 - values
should be
    return 1.0 - values*x
Running again with this correction

test_approx(alpha = 1.5,beta = 0.0)

[image: image] https://user-images.githubusercontent.com/27948079/64675608-8d357680-d4ae-11e9-9239-7b6b03c4d854.png

Problem solved. I printed the numbers to check that this is the point where the difference of logs is minimum because I was skeptical of why it would not happen sooner. But 52.1 is the point. Therefore, all x<52.1 will be computed with the interpolated grid and all x>52.1 will be computed with the pareto cdf as you were hoping in the first place. However, I saw that the dif of logs in 10<x<52.1 can be larger than in x=52.1. After all, that is how np.argmin works. Here, you are comparing the interpolations (y) with the pareto cdf (z). My proposal was to compare the y_true and z. I don't fully understand why you are doing this, but if you say it helps with the maximum likelihood, then I guess it is OK.

Now let's try another more challenging values of α and β

test_approx(alpha = 0.9,beta = 0.6)

[image: image] https://user-images.githubusercontent.com/27948079/64677230-20bc7680-d4b2-11e9-95fa-c22465b22181.png

The difference between the y and y_true is more noticeable, but it's no problem since the limit=69.1 will allow to use the pareto cdf for all x>69.1. Until here no problem.

What about negative betas?

test_approx(alpha = 0.9,beta = -0.6)

[image: image] https://user-images.githubusercontent.com/27948079/64677828-5150e000-d4b3-11e9-8eef-1160c244e36f.png

The limit is 238.3, and corresponds to the point where the dif of logs is smallest. But the interpolated line is way off. These values correspond to β=0.6. Therefore, I think there is another mistake in your function _int_levy(). I changed the line

points[..., 2] = np.abs(beta)

to

points[..., 2] = beta

Run again

test_approx(alpha = 0.9,beta = -0.6)

[image: image] https://user-images.githubusercontent.com/27948079/64678289-3e8adb00-d4b4-11e9-851a-09cf8b4dbcb8.png

Apparently everything looks good now. However, like before, I am guessing that you used the absolute value of β for a reason. Try to check if the change that does not mess anything else in the code.

I re-ran _make_limit_data_file() after these two changes and plotted again the graph comparing the results of the main function levy(cdf=False) and _calculate_levy(). [image: image] https://user-images.githubusercontent.com/27948079/64679788-6596dc00-d4b7-11e9-8631-efe4fa0c0260.png The jumps disappeared, yay! Bear in mind that the central graph corresponding to α=1.00 is still plotting the Cauchy distribution (β=0). I think that is the only issue remaining. Have you decided which solution will you apply? (1. change the grid of alphas without risking accuracy, or

complete the _calculate_levy() function with the special case α=1.0 equations)

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/josemiotto/pylevy/issues/11?email_source=notifications&email_token=AACCTUVCPIB6EKKEOO4TTZTQJCTZHA5CNFSM4IQ7YXY2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6NWNBA#issuecomment-530278020, or mute the thread https://github.com/notifications/unsubscribe-auth/AACCTUQLKJX3CB3EGOI2H53QJCTZHANCNFSM4IQ7YXYQ .

josemiotto commented 4 years ago

Also, you are right about using y_true on the limit calculation, but the interpolation should work such that if the point is in the grid, gets y_true. It's done like this because I was using the interpolation for x.

On Wed, 11 Sep 2019, 10:37 José María Miotto, josemiotto@gmail.com wrote:

Very very impressive! In deed the error is in approximate. Now that I see it, is obvious, because the cdf should have an exponent 1 lower than the pdf, and it had the same. Multiplying by x solves the issue, thanks. I will check when that code was introduced.

Wrt to the alpha=1, I will introduce the special case.

Thanks again

On Wed, 11 Sep 2019, 10:28 Nitram-GGA, notifications@github.com wrote:
which are the same thing!

Yep, sorry about that. Necesito mas cafe.

Then, let's pay attention at why the function _get_closest_approx(alpha, beta) is not working. Let me copy below your function:

def _get_closest_approx(alpha, beta):
x0, x1, n = -50.0, 10000.0 - 50.0, 100000

dx = (x1 - x0) / n

x = np.linspace(x0, x1, num=n, endpoint=True)

y = 1.0 - _int_levy(x, alpha, beta, cdf=True)

z = 1.0 - _approximate(x, alpha, beta, cdf=True)

mask = (10.0 < x) & (x < 500.0)

return 10.0 + dx * np.argmin((np.log(z[mask]) - np.log(y[mask])) ** 2.0)
Now let's plot y and z as defined above. Additionally let's plot the results of calculating the cdf with your _calculate_levy() function and call it y_true . Also, let's plot the grid of x with which the cdf grid/file was created using vertical gray lines. Also, let's plot x=10 and x=500 with vertical orange lines. And finally let's plot the limit using a vertical green line.

import numpy as np import matplotlib.pyplot as plt from levy import _int_levy, _approximate, _calculate_levy

def test_approx(alpha = 1.5,beta = 0):
x0, x1, n = -50.0, 1000.0 - 50.0, 10000

dx = (x1 - x0) / n

x = np.linspace(x0, x1, num=n, endpoint=True)

# cdf interpolated from grid

y = 1.0 - _int_levy(x, alpha, beta, cdf=True)

# cdf calculated with integrals

y_true = 1.0 - np.array([_calculate_levy(t, alpha, beta, True) for t in x])

# cdf of the Pareto distribution

z = 1.0 - _approximate(x, alpha, beta, cdf=True)

# Calculate limit between 10 and 500

mask = (10.0 < x) & (x < 500.0)

limit = 10.0 + dx * np.argmin((np.log(z[mask]) - np.log(y[mask])) ** 2.0)

# Create figure

fig, ax = plt.subplots(figsize=(12,5),facecolor='white')

ax.plot(x,y_true,'-',color='#85c1e9',lw=4.0,zorder=5,label='w/ _calculate_levy()')

ax.plot(x,y,'--',color='r',lw=2.0,zorder=10,label='Interpolated w/ grid')

ax.plot(x,z,'--',color='k',lw=1.3,zorder=10,label='Approximated w/ Pareto cdf')

# Tune the axes

ax.set_xscale('log')

ax.set_xlim(0.1,1000)

ax.set_yscale('log')

ax.set_ylim(0.00001,1)

# Plot the grid of x used to do the grids of cdfs

xgrid = np.tan(np.linspace(-np.pi / 2 * 0.999,np.pi / 2 * 0.999,num=200))

for xg in xgrid[xgrid>0]:

    ax.axvline(x=xg,ls = '-',color='k',lw=0.5,alpha=0.6,zorder=0)

# Range in which the limit is evaluated

ax.axvline(x=10,ls='-',color='#dc7633',lw=2.0,zorder = 0)

ax.axvline(x=500,ls='-',color='#dc7633',lw=2.0,zorder = 0)

# The limit

ax.axvline(x=limit,ls='-',color='g',lw=3.0,zorder = 30,label='Limit={:5.1f}'.format(limit))

ax.legend(loc=1,facecolor='white',edgecolor='k')

plt.show()

return
Run

test_approx(alpha = 1.5,beta = 0.0)

[image: image] https://user-images.githubusercontent.com/27948079/64674698-5c544200-d4ac-11e9-9ca6-af1e52a97eb9.png

Having defined xgrid as

xgrid = np.tan(np.linspace(-np.pi / 2 0.999,np.pi / 2 0.999,num=200))

the last 8 values are 8.89, 10.36, 12.41, 15.44, 20.44, 30.19, 57.66 and 636.62. As expected, the interpolated and true cdfs are almost equal in all the points where they cross the gray lines. As the space between the gray lines gets larger, the interpolation will cause slight differences in between the gray lines. However, the odd thing is the Pareto cdf. It does not look anything like the other two lines and that is why you are getting a limit equal to 10. Because there is where the difference of logs is the smallest.

I think you have a mistake in your _approximate() function. The line
    return 1.0 - values
should be
    return 1.0 - values*x
Running again with this correction

test_approx(alpha = 1.5,beta = 0.0)

[image: image] https://user-images.githubusercontent.com/27948079/64675608-8d357680-d4ae-11e9-9239-7b6b03c4d854.png

Problem solved. I printed the numbers to check that this is the point where the difference of logs is minimum because I was skeptical of why it would not happen sooner. But 52.1 is the point. Therefore, all x<52.1 will be computed with the interpolated grid and all x>52.1 will be computed with the pareto cdf as you were hoping in the first place. However, I saw that the dif of logs in 10<x<52.1 can be larger than in x=52.1. After all, that is how np.argmin works. Here, you are comparing the interpolations (y) with the pareto cdf (z). My proposal was to compare the y_true and z. I don't fully understand why you are doing this, but if you say it helps with the maximum likelihood, then I guess it is OK.

Now let's try another more challenging values of α and β

test_approx(alpha = 0.9,beta = 0.6)

[image: image] https://user-images.githubusercontent.com/27948079/64677230-20bc7680-d4b2-11e9-95fa-c22465b22181.png

The difference between the y and y_true is more noticeable, but it's no problem since the limit=69.1 will allow to use the pareto cdf for all x>69.1. Until here no problem.

What about negative betas?

test_approx(alpha = 0.9,beta = -0.6)

[image: image] https://user-images.githubusercontent.com/27948079/64677828-5150e000-d4b3-11e9-8eef-1160c244e36f.png

The limit is 238.3, and corresponds to the point where the dif of logs is smallest. But the interpolated line is way off. These values correspond to β=0.6. Therefore, I think there is another mistake in your function _int_levy(). I changed the line

points[..., 2] = np.abs(beta)

to

points[..., 2] = beta

Run again

test_approx(alpha = 0.9,beta = -0.6)

[image: image] https://user-images.githubusercontent.com/27948079/64678289-3e8adb00-d4b4-11e9-851a-09cf8b4dbcb8.png

Apparently everything looks good now. However, like before, I am guessing that you used the absolute value of β for a reason. Try to check if the change that does not mess anything else in the code.

I re-ran _make_limit_data_file() after these two changes and plotted again the graph comparing the results of the main function levy(cdf=False) and _calculate_levy(). [image: image] https://user-images.githubusercontent.com/27948079/64679788-6596dc00-d4b7-11e9-8631-efe4fa0c0260.png The jumps disappeared, yay! Bear in mind that the central graph corresponding to α=1.00 is still plotting the Cauchy distribution (β=0). I think that is the only issue remaining. Have you decided which solution will you apply? (1. change the grid of alphas without risking accuracy, or

complete the _calculate_levy() function with the special case α=1.0 equations)

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/josemiotto/pylevy/issues/11?email_source=notifications&email_token=AACCTUVCPIB6EKKEOO4TTZTQJCTZHA5CNFSM4IQ7YXY2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6NWNBA#issuecomment-530278020, or mute the thread https://github.com/notifications/unsubscribe-auth/AACCTUQLKJX3CB3EGOI2H53QJCTZHANCNFSM4IQ7YXYQ .

Nitram-GGA commented 4 years ago

Don't forget to change abs(β) to β in _int_levy().

I saw that you are using the same limit for both tails.

Another thing, in that paper they found out that the cdf converges faster to the pareto cdf than the pdf. It would be worth checking that in the case where your limit approximates to 10, it holds for the left tail and the pdf as well. Maybe it would be better to compute limits for pdf and cdf separately?

josemiotto commented 4 years ago

I will check!

josemiotto commented 4 years ago

When I switched to python 3, I refactored the approximating function, and that created the error you reported. I never used really the package afterwards, so I didn't realized, and you are the first user who does!