paroussisc / stats

Repo to track the material I cover on Wednesdays
MIT License
1 stars 0 forks source link

Central Limit Theorem #5

Open paroussisc opened 5 years ago

paroussisc commented 5 years ago

An example of the CLT in R using Poisson random variables.

paroussisc commented 5 years ago

First, we look at the estimated means and confidence intervals for exponentially increasing samples using the following code:

library(ggplot2)

mu <- 24.32

# names of the files for the GIF
png(file = "clt%02d.png",
    width = 200,
    height = 200)

# create plots for exponentially increasing sample size
for (n in unique(as.integer(exp(seq(0, 8, 0.1)))))
{
    # simulate from poisson distribution
    data <- data.frame(z = rpois(n, mu))
    sample_mu <- mean(data$z)
    sample_sd <- sd(data$z)
    ub <- sample_mu + qnorm(0.975) * sample_sd / sqrt(n)
    lb <- sample_mu + qnorm(0.025) * sample_sd / sqrt(n)

    # plot distribution of sample
    print(
        ggplot() + stat_density(data = data, aes(x = z), alpha = 0.4) +
            geom_vline(xintercept = sample_mu, color = "blue", aes(ymin = 0)) +
            geom_vline(xintercept = mu, color = "red") +
            geom_rect(
                aes(
                    xmin = lb,
                    xmax = ub,
                    ymin = 0,
                    ymax = Inf
                ),
                fill = "green",
                alpha = 0.2
            ) + annotate(
                "text",
                x = 15,
                y = 0.07,
                label = paste("N = ", n)
            ) + scale_x_continuous(limits = c(10, 40))
    )

}
dev.off()

system("convert -delay 50 *.png clt_1.gif")
file.remove(list.files(pattern = ".png"))
paroussisc commented 5 years ago

Which creates the following GIF:

clt_1

Where the red vertical line is the true mean, the blue represents the sample mean, and the green box is the 95% confidence interval.

Of course these are just single instances of sample for each N - this means the bounds can be quite sensitive to random variation in smaller samples (the confidence intervals tend to get smaller but there are some outliers). Most of the intervals appear to contain the true mean, as one would expect, and the larger samples do look particularly Gaussian.

It would be better to look at the distribution of the sample mean, so simulating many samples of N is the next task.

paroussisc commented 5 years ago

In this example, each sample contains 1000 Poisson r.v.s, and we have 1000 of such examples. We are interested in the distribution of the sample means and use the following code to investigate:

sample_groups <- 1000
n <- 1000
mu <- 24.32

data <- matrix( rpois(sample_groups * n, mu), sample_groups, n)
x_bar <- data.frame(x = rowMeans(data))

# mean of means should be close to mu for large n
mu_diff <- mean(x_bar$x) - mu

# the sd of the means should be sqrt(mu/n) for a poisson distribution
sd_diff <- sd(x_bar$x) - sqrt(mu/n)

# test for normality
shapiro.test(x_bar$x)
paroussisc commented 5 years ago

The mean and standard deviation of the sample means closely match what is expected from the CLT. Performing Shapiro-Wilk’s method for normality does not lead us to believe the samples means are generated from a non-normal distribution. This test is sensitive to small samples but we should be OK with 1000 here.

paroussisc commented 5 years ago

We can just loop across different samples sizes and output mu_diff and sd_diff from above, and also whether the normality test is significant. When this is performed the differences in mean and expected mean are small, especially for large sample size (as expected). The same for the difference in standard deviation and expected standard deviation. The samples almost always appear Gaussian but the odd small sample can look non-Gaussian by the Shapiro-Wilk's test, which isn't a cause for concern (not that we'd doubt the CLT anyway).

paroussisc commented 5 years ago

As an example:

[1] "For sample size: 7 - mu diff: -0.066 - sd diff: 0.04 is Normal? TRUE"
[1] "For sample size: 9 - mu diff: -0.019 - sd diff: -0.004 is Normal? TRUE"
[1] "For sample size: 13 - mu diff: 0.049 - sd diff: 0.004 is Normal? TRUE"
[1] "For sample size: 18 - mu diff: 0.106 - sd diff: -0.044 is Normal? TRUE"
[1] "For sample size: 24 - mu diff: -0.008 - sd diff: -0.005 is Normal? TRUE"
[1] "For sample size: 33 - mu diff: 0.03 - sd diff: -0.01 is Normal? TRUE"
[1] "For sample size: 44 - mu diff: -0.032 - sd diff: 0.006 is Normal? TRUE"
[1] "For sample size: 60 - mu diff: -0.002 - sd diff: -0.002 is Normal? TRUE"
[1] "For sample size: 81 - mu diff: -0.016 - sd diff: 0.003 is Normal? TRUE"
[1] "For sample size: 109 - mu diff: 0.001 - sd diff: 0.004 is Normal? TRUE"
[1] "For sample size: 148 - mu diff: 0.011 - sd diff: -0.002 is Normal? TRUE"
[1] "For sample size: 200 - mu diff: -0.01 - sd diff: -0.008 is Normal? TRUE"
[1] "For sample size: 270 - mu diff: 0.005 - sd diff: 0.005 is Normal? TRUE"
[1] "For sample size: 365 - mu diff: 0.001 - sd diff: 0.001 is Normal? TRUE"
[1] "For sample size: 492 - mu diff: -0.006 - sd diff: -0.004 is Normal? TRUE"
[1] "For sample size: 665 - mu diff: -0.003 - sd diff: 0.007 is Normal? TRUE"
[1] "For sample size: 897 - mu diff: -0.001 - sd diff: -0.006 is Normal? TRUE"
[1] "For sample size: 1211 - mu diff: -0.004 - sd diff: -0.004 is Normal? TRUE"
[1] "For sample size: 1635 - mu diff: 0.009 - sd diff: -0.001 is Normal? TRUE"
[1] "For sample size: 2208 - mu diff: -0.001 - sd diff: 0.003 is Normal? TRUE"
[1] "For sample size: 2980 - mu diff: 0 - sd diff: -0.001 is Normal? TRUE"
[1] "For sample size: 4023 - mu diff: -0.001 - sd diff: -0.003 is Normal? TRUE"
[1] "For sample size: 5431 - mu diff: 0.001 - sd diff: 0.002 is Normal? TRUE"
[1] "For sample size: 7331 - mu diff: 0.001 - sd diff: -0.001 is Normal? TRUE"