telmo-correa / all-of-statistics

Self-study on Larry Wasserman's "All of Statistics"
993 stars 279 forks source link

Issues with Exercise 9.6.2 #24

Open Pipe-Vash opened 2 years ago

Pipe-Vash commented 2 years ago

1- I don't think that the "fourth method" that the exercise ask for is the Parametric bootstrap. but the "The Jacknife" Confidence Interval that is explained in the Appendix of the chapter.

2- The second part of the exercise ask for "the true coverage", which means the "the percentage in which the true value of the statistic" (in this case the Skewness parameter). My way to solve this is the following (excuse me if my code is not very fluent)

import random as rd
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import norm
from tqdm.notebook import trange, tqdm
np.random.seed(1)

def skew(x):
    mean=np.mean(x)
    se=np.std(x)
    return np.mean((x-mean)**3)/se**3

B=10000
def T_boot(x):
    t_boot=np.empty(B)
    n_x=len(x)
    for i in tqdm(range(B)):
        x_star=np.random.choice(x,n_x,replace=True)
        t_boot[i]=skew(x_star)
    return t_boot,np.std(t_boot)

#Creation of 1000 random vectors of n=50 datapoints
Y2=norm.rvs(loc=0, scale=1, size=(1000,50))
X2=np.exp(Y2)

#Mapping each random vector with the bootstrap to produce the confidence intervals
t_bootV=list(map(T_boot,X2))

#Generating a list with the confidence intervals from every random vector
#Normal 95% confidence intervals
z_95=norm.ppf(0.975)
NCI=np.array(list(map(lambda x,y: (skew(x)-z_95*y,skew(x)+z_95*y),X2,list(zip(*t_bootV))[1])))

#Percentile 95% confidence intervals
PerCI=np.array(list(map(lambda x: (np.quantile(x,0.025),np.quantile(x,0.975)),list(zip(*t_bootV))[0])))

#Pivotal 95% confidence intervals
PivCI=np.array(list(map(lambda x,y: (2*skew(x)-np.quantile(y,0.975),2*skew(x)-np.quantile(y,0.025)),X2,list(zip(*t_bootV))[0])))

According to theory:

So that the skewness parameter of $X=e^Y$ is $\theta=T(F)=\int (x-\mu)^{3} {\rm d}F(x) /\sigma_X^3=\frac{E[X-E(X)]^3}{\left(E[X-E[X)]^2\right)^{3/2}}=\frac{e^{3/2} (2-3 e +e^3)}{e^{3/2}(e-1)^{3/2}}= \frac{(2-3 e +e^3)}{(e-1)^{3/2}} \approx 6.18$. This will be the value we expect to be within our confidences intervals (not only for each method, but also for each simulated random vector). The "true coverage" estimations continues as follow:

#The true parameter:
skewness_f=(2-3*np.exp(1)+np.exp(1)**3)/(np.exp(1)-1)**(3/2)

#Comparison with the Normal 95% Confidence Intervals
cov_NCI=np.count_nonzero((NCI.T[1]>=skewness_f) == (skewness_f>=NCI.T[0]))/len(NCI.T[0])
cov_NCI
#Comparison with the Percentile 95% Confidence Intervals
cov_PerCI=np.count_nonzero((PerCI.T[1]>=skewness_f) == (skewness_f>=PerCI.T[0]))/len(PerCI.T[0])
cov_PerCI

#Comparison with the Pivotal 95% Confidence Intervals
cov_PivCI=np.count_nonzero((PivCI.T[1]>=skewness_f) == (skewness_f>=PivCI.T[0]))/len(PivCI.T[0])
cov_PivCI

print('The true coverage for the simulated Normal confidence intervals is: %.3f' % cov_NCI)
print('The true coverage for the simulated Percentile confidence intervals is: %.3f' % cov_PerCI)
print('The true coverage for the simulated Pivotal confidence intervals is: %.3f' % cov_PivCI)

The output of the simulation: image This means that our 50 datapoints are not enough to make our bootstrap 95% confidence intervals to reach the expected coverage...

With $n=1000$ datapoints the coverage gets better: image

With $n=100000$ datapoints the coverage gets closer to the expected 95%, image

Assuming that my procedure is correct, I only need more data points (which is more computationally expensive, reason of which I do not continue reporting results)...

I would appreciate some feedback to compare for.

GustavoMeza commented 1 year ago

I got roughly the same coverage. I came here thinking I have done something wrong.

gallego-posada commented 11 months ago

I agree with the comments above.

@Pipe-Vash If you want to verify that you get the correct asymptotic coverage without having to generate millions of samples ($N \approx 10^6$), you can sample $Y$ from a Gaussian with smaller variance $\sigma^2 < 1$. Then you check for coverage by testing whether the bootstrap intervals contain the true skewness of a general $\text{LogNormal}(\mu=0, \sigma^2)$.