solomonxie commented 6 years ago

Statistics is all about PREDICTION: Given some real information, and predict what will happen next.

Study Resources

Tools

[ ] MIT Mathlets
- [ ] PROBABILITY DISTRIBUTIONS
- [ ] T DISTRIBUTION
- [ ] CONFIDENCE INTERVALS
- [ ] LINEAR REGRESSION
[ ] Online Stat Book (java required)
- [ ] Simulation Demos
- [x] Normal Distribution Simulation
- [x] Sampling Distribution Simulation
[x] Omni Stats Calculators
- [x] Normal Distribution Calculator
- [x] Combination Calculator
- [x] Permutation Calculator
- [ ] Binomial Distribution Calculator
- [ ] Geometric Distribution Calculator
- [ ] Confidence Interval Calculator
[x] SurfStat
- [x] T-distribution Calculator
- [x] Standard Normal Calculator
[ ] DI Management
- [x] Chi-square calculator
[ ] Maths if Fun
- [x] Chi-Square Calculator

Khan academy AP Statistics

[x] Categorical data
[x] Quantitative data (Display & Describe)
[x] Quantitative data (Summarize)
[x] Modeling Data Distributions
[x] Bivariate Numerical data
[x] Study design
[x] Probability
[x] Counting, permutations, and combinations
[x] Random Variables
[x] Sampling Distributions
[x] Confidence intervals
[x] Significance Tests (Hypothesis Testing)
[x] Inference different groups
[x] Chi-Square Tests for Categorical data
[x] Advanced Regression (inference and transforming)
[x] ◆ Course Challenge ◆

Machine Learning related topics

[ ] Bayesian Statistics
[ ] Random Variables
[ ] Logistic Regression (with Logistic model)
[ ] Linear Regression (with Gradient Descent)
[ ] Hypothesis Testings

solomonxie commented 5 years ago

❖ Geometric Random Variables

Refer to wiki: Geometric distribution Refer to Khan academy: Geometric random variables introduction

「Geometric Random Variable」 vs. 「Binomial Random Variable」

A binomial setting has a set number of trials, and the variable in question is the number of successes that occur in those trials.
A geometric setting DOES NOT have a set number of trials, and the variable in question is the number of trials it takes to get the first success.

In both settings, the trials are independent and the probability of success remains the same on each trial.

The only difference between Geometric R.V. and Binomial R.V. is that, The Geometric R.V. DOES NOT have a certain number of trails.

Requirements of Geometric R.V.:

p: Same probability p on each trail.
Yes-no question: Each trail's outcome is either success or failure.
Independent: Each trail is independent to each other.

Understanding「Geometric Probability」

The geometric distribution gives the probability that the first occurrence of success requires k independent trials, each with success probability p.

If it's asking for Number of Trails, then the number Trails = failures + success = (n-1) + 1. If it's asking for Number of failures, then the number Trails = Failures+1 = n+1. And that's why the formula is slightly different.

Assume p is the probability of success on each trail, n is the number of trails or failures:

Number of Trials until the first success:
Number of Failures until the first success:

Example

Solve:

「Mean & Variance」 of Geometric Probability

「Cumulative Geometric Probability」

We know how to calculate Geometric Probability at each value, but Cumulative G.P. would be bit tricky.

The formula literally means: FAIL a TIMES IN A ROW.

This formula is good for X>a, and with a bit twist you can get most out of it. etc.,

P(X≥4) is the same with P(X>3)
P(X≤5) is the same with 1 - P(X>5)
P(X<7) is the same with 1 - P(X>6)

Example

Solve:

The easiest way is to apply the cumulative geometric probability formula:
P(X<5) = 1 - P(X>4) = 1 - Failure⁴ = 1-0.9^4 = 0.34
Another way is to calculate each item in the sequence: P(C≤4) = P(C=1) + P(C=2) + P(C=3) + P(C=4)
which we expand it to: P(C≤4) = 0.9⁰*0.1 + 0.9*0.1 + 0.9²*0.1 + 0.9³*0.1
We could easily recognize it as a standard Geometric Series, so we can apply the formula: P(C≤4) = 0.1 * (1-0.9⁴) / (1-0.9) = 0.34

Example

Solve:

It's the same as "the probability of 5 failures in a row."

Example

Solve:

solomonxie commented 5 years ago

❖ Sampling Distribution

It's just taking out the parameters(Mean/SD..) from different samples of the SAME population, and make them as a distribution. etc., Distribution of means, Distribution of standard deviations...

Refer to Khan academy: Introduction to sampling distributions

「Bias」 of Sample Statistic

Refer to Khan academy: Sample statistic bias worked example

Unbiased Sample Estimation: The distribution is roughly symmetric at the real parameter of population.
Biased Sample Estimation: The sample distribution is NOT symmetric at the real parameter of population.

Example: The dotplots below show an approximation to the sampling distribution for three different estimators of the same population parameter, and the actual value of the population parameter is 2.

「Shape」 of Sample Distribution

Usually the Sample Distribution is Normally distributed, only on these conditions:

The Population is Normally distributed, or
There're at least 10 expected success in the sample, and
There're at least 10 expected failures in the sample.

It means that:

np ≥ 10, and
n(1-p) ≥ 10.

But under some extreme conditions it can also be skewed: The expected success items are less than 10 in the sample.

Example

Solve:

solomonxie commented 5 years ago

❖ Sample Mean

Also called "Sampling distribution of Sample Mean".

「Mean & Variance」 of Sample Means

Example

Solve:

「Probability of sample mean」 exceeding a value

Refer to Khan academy: Example: Probability of sample mean exceeding a value

Example

Solve:

The Sample Size is too small, which is hard to make the Sample Distribution normal.
It didn't tell the Parent Distribution's shape, so there's no way to tell if the Sample Distribution can match the parent shape as normal shape.

solomonxie commented 5 years ago

❖ Sample Proportion

The full name is Sampling Distribution of the Sample Proportion, which's denoted by p-hat.

Refer to youtube: The Sampling Distribution of the Sample Proportion Refer to article: The Sample Proportion Refer to article: Sampling Distribution of the Sample Proportion, p-hat

Sample Proportion is the proportion of success in a sample.

Sample Proportion(p-hat) is a random variable, specifically a Binomial Random Variable.

So let X denotes the number of success in the sample, which is the Binomial Random Variable with parameter n and p,

「Mean & Variance」 of Sample Proportions

Recall that the binomial random variable X:

has a mean of np
has a variance of np(1-p)
is approximately normally distributed for LARGE Sample Sizes (Central Limit Theorem).

Hence, we derived the Mean & Variance of Sample Proportion p-hat from X:

Why is that?

That's why we say: p-hat is an unbiased estimator for p of population.

And for Standard Deviation of Sample Proportion:

Example

Solve:

「Probabilities」 with Sample Proportions

It's just to find out the probability area in the Normal Distribution. All you need is the Mean, Standard Deviation and the point you're to measure.

Example

Solve:

Calculate the mean of sample proportion is 0.63
Calculate the standard deviation of sample proportion is 0.019
Use an online Normal Distribution calculator, set up the upper boundary as 0.60
Get the probability area is about 0.06

solomonxie commented 5 years ago

❖ Standard Error (SE) [DRAFT]

Standard Error is actually a short version of saying the Standard Deviation of the Sampling Distribution.

Refer to wiki: Standard error

Why is it an 「Error」?

Assume the Population mean is 0, then the Standard Error is the error/distance of the Sample Mean away from the True population mean.

What if we don't know the 「Population variance」?

「Standard Error of Sample Mean」SEM

If we're to find the Standard Deviation of the Mean of a Sampling Distribution, we call it as the Standard Error of the Mean (SEM).

Refer to Khan academy: Standard error of the Mean

Standard Error gives us how far the Sample Mean will deviate from the true mean.

「Standard Error of Sample Proportion」

solomonxie commented 5 years ago

❖ 「Central Limit Theorem」 and 「Law of Large Numbers」 [DRAFT]

Refer to article: Central Limit Theorem and Law of Large Numbers

The Central Limit Theorem is about the SHAPE of the distribution. The Law of Large Numbers tells us where the CENTRE (maximum point) of the bell is located.

「Central Limit Theorem」

One of the most fundamental & profound concepts in Statistics or even Mathematics.

Refer to youtube: Introduction to the Central Limit Theorem Refer to Khan academy: Central limit theorem

The theorem is about the Sample Mean, saying:

The distribution of Sample Mean tends towards the Normal Distribution as the Sample Size increases, regardless of the shape of Population Distribution.

As a very rough guidline, the Sample Mean is approximately Normally distributed if the Sample Size is at least 30.

That being said:

If the Sample Size < 30, the shape of Sample Mean Distribution will be matching the shape of Population Distribution.
If the Sample Size ≥ 30, the shape of Sample Mean Distribution will be Normally distributed, regardless what the shape of Population Distribution is.

How 「Normal」 is the distribution

Refer to Khan academy: Sampling distribution of the sample mean

「Skewness」

「Kurtosis」 Tailedness

『Standard Deviation』

The SD tends to be smaller and smaller as the Sample Size increases or the more times you take samples.

「Law of Large Numbers」 (LLN)

Refer to Khan academy: Law of large numbers Refer to wiki: Law of large numbers

It is saying: The average of the results obtained from a large number of trials should be close to the expected value, and will tend to become closer as more trials are performed.

Strong 「Law of large numbers」

Weak 「Law of large numbers」

solomonxie commented 5 years ago

❖ Confidence Interval (CI)

Since there will always be sampling error for estimating the true population, so it's a good practice to have a confidence interval while doing estimation on samples.

Refer to youtube: Understanding Confidence Intervals: Statistics Help Refer to article: Confidence Level & Margin of Error

A Confidence Interval is a "Tolerance Interval", which statistically is an estimated RANGE of values that seem reasonable, which controls how accurate the estimation to be.

We've learnt how to estimate an exact value for Population parameters. But the estimation can't be too good if it's exact. Hence confidence interval gives a more reliable way to describe/guess the population.

「Inference」 & 「Inferential Statistics」

Inference means the conclusions we got from the sample to describe the population .

Inferential Statistics or Statistic inference means how we can go from describing data we already have to make inferences about data we don't have.

「Confidence Level」 Tolerance Level

Confidence Level is a decision we made, that how much precise we want the guessing to be. 95% is the confidence level people often use.

Width of Confidence Interval: The width of confidence interval will be affected by two things:

Variation
Sample Size

「Point Estimator」 Expected Value

You can literally call it the Best estimator, which is the best estimate of a population parameter.

A point estimate of a population parameter is a single value used to estimate the population parameter. For example, the sample mean x is a point estimate of the population mean μ.

Point Estimate uses sample data to calculate a single value which is to serve as a "best guess" or "best estimate" of an unknown population parameter (for example, the population mean). More formally, it is the application of a point estimator to the data to obtain a point estimate.

Point Estimate is often set to be the Sample Mean, as the Centre of Confidence Interval.

Why set the 「Point Estimator」 as 「Centre」?

Because the Point Estimate (Expected value) is our Best guess, and every value differs with that would be seen as an error. By stocking up all the errors around the "best guess" within our "Tolerance level" (Confidence level), we get a Confidence Interval.

「Sampling Error」

It's also called "Variation due to sampling".

Since the sample will NEVER BE PERFECT to represent the true population, so there will always be Sampling error.

The less error allowed, the less confidence we are.
The greater error allowed, the more confidence we are.

「Margin of Error」

You can literally call it Limit of confidence or Confidence Limit.

We made up a decision of confidence level we want it to be, and we set the Sample Mean as the CENTRE of the range, which slice the interval to half:

The confidence limits (min/max) is given by this formula, which uses the Margin of Error:

x = mean of the sample z = z-score representing the size of the confidence interval you have set, measured in units of standard deviations from the mean s = standard deviation of the sample n = number of entries in the sample

Formula for both 「Z-interval」 & 「T-interval」

What 「z-score」 should you use?

Confidence	Z
80%	1.282
85%	1.440
90%	1.645
95%	1.960
99%	2.576
99.5%	2.807
99.9%	3.291

solomonxie commented 5 years ago

~Calculating Confidence Interval~ [DEPRECATED]

Calculating the Confidence Interval basically is just to convert a confidence level , say 95%, to a real value range, etc., (13kg, 28kg).

Refer to Head First Statistics: Chapter.12

There're a few ways to calculate:

Traditional Normal-based
Informal
Bootstrapping

We use the traditional Normal-based method more often.

Assume we've decided which population statistic we'll be estimating, and which level of confidence we need it to be, then there're 2 steps to calculate the Confidence Interval:

Find the Sampling Distribution
Find the Confidence Limits

Finding the Sampling Distribution

We know for constructing a Sampling Distribution we need the Mean & Variance:

Since the population variance 𝜎² is UNKNOWN to us, so we are to estimate the population variance by the sample's "best estimator" (point estimator), which is Sample Variance s² in this case.

s² is the Sample Variance, there're two types of formula: here is the common formula:

here is the formula for sample proportion:

We assume the Sampling Distribution is normally distributed.

Finding the Confidence Limits

Now we

Inverse Cumulative Normal Probability

Invert the given cumulative normal probability (Confidence Level) back to z-score.

We've learnt how to convert a percentile to z-score, and how about the Cumulative Normal Probability?

It's easy, in the graph we see that the Confidence Level is the middle part. if we cut the middle off, we'll get two tails, and either one can tell us the percentile position.

solomonxie commented 5 years ago

❖ 「Z-Interval」 Z statistics

Z interval is the Confidence Interval constructed using Z-score.

▶︎ Jump back to previous note on: Z-score

Conditions for a valid 「Z Interval」

The conditions we need for inference on one proportion are:

Random: The data needs to come from a random sample or randomized experiment.
Normal: The normal condition says that we need at least 10 successes and 10 failures in our sample data.
Independent: The independence condition says that when sampling without replacement, we can still treat each observation in the sample as independent as long as we sample less than 10%, percent of the population.

Formula of 「Z-interval」

Understanding the formula for 「Margin of Error」

Remember Standard Error is (X-μ)/Z, in which (X-μ) is the distance from Sample to Population, so called the Margin of Error, which is the thing we're looking for.
So doing the Z · (X-μ)/Z = (X-μ) is kinda Reversing the normalization of the distance back to the real distance.

「One-sample」 Z Interval

Only take the sample once from the population.

▶︎ Practice at Khan academy: Calculating a z interval for a proportion

▶︎ Tool: Omni Online Confidence Interval Calculator

Refer to Khan academy: Critical value (z*) for a given confidence level

Here is the formula for a one-sample z interval for a sample proportion:

in which the margin of error is:

Example

Solve:

Example

Solve:

「Sample Size」 & 「Margin of Error」

Example

Solve:

Example

Solve:

Estimating 「Margin of Error」

[Refer to Khan academy: Determining sample size based on confidence and margin of error](Determining sample size based on confidence and margin of error)

Example

Solve:

We haven't been told what is the probability yet, so we have to estimate it first.
We know the formula for margin of error:
In the formula above, if we're to set an upper bound on it, we need to find the largest margin of error, which require either the sample size n to be smallest or the p(1-p) to be largest.
Since it's asking for a smallest sample size, so we need to find the largest p(1-p).
We've learnt the optimization from Calculus how to get the max value from an equation:
We take the derivative of p(1-p) and set to 0, to get the max value: p = 0.5
Calculate the rest:

「Z-table」

solomonxie commented 5 years ago

❖ T-Interval (t statistics)

T interval is good for situations where the sample size is small and population standard deviation is unknown.

When the sample size comes to be very small (n≤30), the Z-interval for calculating confidence interval becomes less reliable estimate. And here the T-interval comes into place.

Refer to Khan academy: Small sample size confidence intervals

「T-Distribution」

The full name is Student's t-distribution, which is a tweaked version of Normal Distribution.

Refer to wiki: Student's t-distribution

When the sample size is small, the Normal distribution will no longer be a good fit for estimating the population. So we introduced the tweaked version of Normal Distribution for a small sample sized sampling data, which we called T-distribution.

「T-distribution」 vs. 「Normal distribution」

They have the same centre: Sample Mean. But the tail of t-distribution is "fatter" than the Normal distribution.

Conditions for a valid 「T Interval」

The conditions we need for inference on one proportion are:

Random: The data needs to come from a random sample or randomized experiment.
Normal: The sample size is at least 30.
Independent: The independence condition says that when sampling without replacement, we can still treat each observation in the sample as independent as long as we sample less than 10%, percent of the population.

「T-score」

Refer to article: What is the T Score Formula?

A t score is one form of a Standardized Test Statistic (the other you’ll come across in elementary statistics is the z-score). The t score formula enables you to take an individual score and transform it into a standardized form>one which helps you to compare scores. You’ll want to use the t score formula when you don’t know the population standard deviation and you have a small sample (under 30).

The t score formula is: (x⁻ is the Sample Mean, μ₀ is mean from null hypothesis, sx is the Sample SD, n is Sample size)

Understanding the formula

The statistic - parameter results the DISTANCE from Sample mean to Population mean. The Standard Error represents the DISTANCE from Sample SD to population SD. => Therefore, dividing the Distance of mean by Distance of SD will results in a Normalized Distance for mean.

▶︎ Jump back to previous note on: Standard Error

Formula of 「T-interval」

The difference with Z-interval's formula is instead of using Z* value, we'll be using the T* value, and the calculation of Standard Error is different too.

「One-sample」 T interval

Example

Solve:

T interval for 「paired data」

Refer to article on Khan academy: Making a t interval for paired data

「T-table」

solomonxie commented 5 years ago

❖ Hypothesis Testing

Hypothesis Testing is that we make a assumption, or a hypothesis about something, and we then make a test and do statistic on it as evidence to against the hypothesis.

We can NEVER prove the null hypothesis, because "INNOCENT UNTIL PROVEN GUILTY".

Refer to youtube: What is a Hypothesis Test and a P-Value?

「Null Hypothesis」 & 「Alternative Hypothesis」

Refer to youtube: Hypothesis Testing 2: null and alternative hypothesis (one sample t test)

Notations:

H₀: Null Hypothesis (reads H-knot) Null hypothesis is the assumption we claimed as our opinion.
Ha: Alternative Hypothesis (reads H-alternative) Alternative hypothesis is the opposition against the null hypothesis.

etc., if the null hypothesis is "Jason's IQ is 130", then the alternative hypothesis is "his IQ is below 130".

The 「Null Hypothesis」 H₀

The null hypothesis should always contain a statement of equality. Another way of thinking of it is that the null hypothesis is a statement of "no difference." We can write the null hypothesis in the form:

H₀: parameter = value

The 「Alternative Hypothesis」 Ha

The alternative hypothesis could take one of three forms, depending on the context of the test:

Ha: parameter > value
Ha: parameter < value
Ha: parameter ≠ value

Example

Solve:

「Test statistic」

Test statistic is the Normalized value for the evidence in hypothesis, which could be:

Z-score (good for sample proportions)
T-score (good for sample means)

Once you get the Test statistic value in a Normal Distribution, you'll easily get the probability area, which you could compare with the threshold.

「Simple Hypothesis Testing」

Example

Solve:

According to the table, out of 100010001000 simulated samples:
- 5 had 80%, percent satisfied customers
- None had a lower measured percentage of satisfied customers
In total, these sum up to 5 simulations out of 1000. Therefore, the simulations imply that the probability of having a sample with 80%, percent satisfied customers or less is:
The probability we got is lower than 1%, percent. Therefore, we should reject the hypothesis.

Example

Solve:

Assume the 40% probability is TRUE,
so the probability of getting 3 wins in a row is: 40%^3 = 6.4%
Therefore we SHOULDN'T reject it, because it's higher than 5%.

Conditions for 「Inference on a proportion」

When we want to carry out inferences on one proportion (build a confidence interval or do a significance test), the accuracy of our methods depend on a few conditions. Before doing the actual computations of the interval or test, it's important to check whether or not these conditions have been met, otherwise the calculations and conclusions that follow aren't actually valid.

The conditions we need for inference on one proportion are:

Random: The data needs to come from a random sample or randomized experiment.
Normal: The sampling distribution of p needs to be approximately normal — needs at least 10 expected successes and 10 expected failures.
Independent: Individual observations need to be independent. If sampling without replacement, our sample size shouldn't be more than 10% percent of the population.

solomonxie commented 5 years ago

❖ P-value (One-tailed)

p-value stands for "probability value", which is the most confusing concept in Hypothesis testing. So it's necessary to pick it out here before exceeding to the Significance Testing.

Refer to youtube: Hypothesis Testing 5: p values (one sample t test)

p-value tells the MAXIMUM of the "truth" takes part in your story.

The smaller the true part (p-value) in the story, the greater the evidence against the story(null hypothesis).

For example, you said your IQ is 130. So we build a MODEL based on your claim. And then we ask you to take a real IQ test which tells your IQ is 117. So the calculation tells that the REAL part only takes utmost 0.47% in your "story". Therefore, if there is only less than 0.47% truth in a story, we can claim the story is a LIE! And 0.47% is the p-value.

Steps to calculate 「p-value」

First we assume the story (null hypothesis) is true,
and we do a large number of simulations based on the story to form a normal distribution,
then we take a real sample,
draw the sample data onto the hypothesis distribution,
calculate how much proportion it is from the sample data to a tail,
if it's not told left-tail or right-tail, then the proportion is two-tails adding up together.
And that proportion is the p-value, as the utmost probability of the true story in hypothesis.

p-value from 「Discrete Distribution」

Just to count how many outcomes are "further" than the sample data to the mean, divided by the total outcomes.

Example

Solve:

Example

Solve:

p-value from 「Continuous Distribution」

p-value is a 「Conditional Probability」

▶︎ Jump back to previous note: Conditional Probability

solomonxie commented 5 years ago

❖ Significance Testing

Also called the Null Hypothesis Significance Testing.

We design a Significance Test to evaluate the strength of the evidence against some null hypothesis. The alternative hypothesis is the claim we are trying to find evidence in favor of.

The Significance Test involves with these concepts:

Null hypothesis & Alternative hypothesis
p-value
Significance Level
Type I & Type II Error
Power

「Significance Level」 ⍺, alpha, threshold

For making a "judge" on wether the hypothesis stands or fails, we need a standard or threshold to judge it, which we called the Significance Level, or the Cutoff, denoted ⍺ (alpha).

There are a few common sets on the significance level:

\<1%: Very strong evidence against our claim.
\<5%: Strong evidence against our claim.
\<10%: Weak evidence against our claim.
>10%: Little or no evidence against our claim.

「Critical Values」 & 「Rejection Regions」

Refer to youtube: Hypothesis Testing 4: critical values and rejection regions (one sample t test)

「p-value」

p-value tells the MAXIMUM of the "truth" takes part in your story.

▶︎ Jump back to previous note: p-value

Steps of 「Significance Testing」

Refer to article on Khan academy: Using P-values to make conclusions

Use 「P-value」 to make conclusion

Use 「Confidence interval」 to make conclusion

Refer to Khan academy: Confidence interval for hypothesis test for difference in proportions

▶︎ Jump back to previous note on: Confidence Interval ▶︎ Jump back to previous note on: Significance Test

In a two-sided test, the null hypothesis says there is no difference between the two proportions. In other words, the null hypothesis says that the difference between the two proportions is 0.

We can use a confidence interval instead of a P-value for two-sided tests as long as the confidence level and significance level add up to 100%.

For example, That being said, if the Confidence Interval DOES NOT overlap with the Null Hypothesis Difference, 0 in this case, then the "true difference" will fall into the Significance Level, which should be reject.

Since Confidence Level + Significance Level = 100%:

CI exlcudes 0 ▶ Smaller interval & larger significance ▶ Significance level > P-value ▶ Not reject
CI includes 0 ▶ Larger interval & smaller significance ▶ Significance level < P-value ▶ Reject

Example

Solve:

Since the null hypothesis difference is 0 (H₀: pe-pw=0),
so we're to examine if the confidence interval contains the "assumed difference" 0
Surely the interval 0.09±0.086 contains 0,
therefore the hypothesis should NOT be rejected.

solomonxie commented 5 years ago

❖ Testing Errors (Mistakes)

「Type 𝐈 Errors」 & 「Type 𝐈𝐈 Errors」`

Type I & Type II Errors are conditional probabilities given the hypothesis is true or false.

Refer to Khan academy: Introduction to Type I and Type II errors

What will happen when we're to reject a hypothesis?

Reject the truth: Type I Error
Not reject the truth
Reject the false: Power
Not reject the false: Type II Error

The good thing to do is to Reject the false and Not reject the truth. The bad thing (error) to do is to Reject the truth and Not reject the false, which are called the Type I Error and Type II Error.

「Truth World」 & 「Lies World」

Although both of them are mistakes for doing the right thing, but they're under totally opposite conditions: One is existing in the Truth World, the other in the Lies World

Type I Error is in a "world" that the null hypothesis is actually true.
Type II Error is in a "world" that the null hypothesis is actually a lie.

Since they're "living in a different world", and they're Conditional Probabilities, so the calculation is different too:

Example

Jump to Khan academy for practice: Type I vs Type II error

Solve:

solomonxie commented 5 years ago

❖ Statistical Power

Power is in the context of Statistical Testing, stands for a conditional probability of REJECTING a FALSE HYPOTHESIS.

"power -> power of justice -> ability to remove the bad one"

「Truth World」 & 「Lies World」

Note that, the Power is a Conditional Probability, on the condition of False null hypothesis.

That being said, the distribution is NOT built on the null hypothesis is true anymore, but on the null hypothesis is false.

How to increase 「Power」

Power is the likelihood that our sample result leads us to correctly reject a false null hypothesis.

The main purpose of studying the power is to get more chance to do the RIGHT thing.

There're two main settings affect the power of a significance test:

Significance level: positive impact
Sample size: positive impact

Impact of 「Significance Level」 ⍺

Higher ⍺ -> Higher Power & Type I Error -> Lower Type II Error
Lower ⍺ -> Lower Power & Type I Error -> Higher Type II Error

The logic is:

Lower ⍺ -> closer -> smaller "reject region" -> less chance to reject
Higher ⍺ -> farther -> larger "reject region" -> more chance to reject

Impact of 「Sample Size」

Larger sample sizes increase power.

Example

Solve:

To increase power, we want to increase the Significance Level & Sample size as much as possible.

Example

Solve:

To decrease the Type I Error, we want to decrease the Significance Level & Sample size as much as possible.

solomonxie commented 5 years ago

❖ Z Test (z statistics)

Z Test is a test constructed using the Z-score

▶︎ Jump back to previous note on: Z-score ▶︎ Jump back to previous note on: Z-interval

Formula of 「Z Test Statistic for proportion」

▶︎ Jump back to previous note on: Sample Proportion

The test statistic gives us an idea of how far away our sample result is from our null hypothesis. For a one-sample z-test for a proportion, our test statistic is:

(which p^ is the Sample proportion, p₀ is the proportion from null hypothesis, n is sample size)

Understanding the formula: The statistic - parameter results the DISTANCE from Sample proportion to Population proportion. The Standard Deviation of statistic represents the ~DISTANCE from Sample SD to population SD.~ Therefore, dividing the Distance of proportion by Distance of SD will results in a Normalized Distance for proportion.

Calculating 「Z Test」 about a proportion

Refer to Khan academy: Calculating a z statistic in a test about a proportion

Example

Solve:

Calculating a 「P-value」 given a z statistic

Refer to Khan academy: Calculating a P-value given a z statistic

Example

Solve:

Example

Solve:

To get the probability in a Z-score normal distribution, we need:
- Mean
- Standard deviation
- Z-score
For convenience, we can set the mean as 0 at the centre, and SD as 1
Since it's asking for proportion at left tail, so we can directly input those values in a calculator:
The answer is 0.106

Example

Solve:

Get a calculator, and input these values:
- Mean (default): 0
- SD (default): 1
- Z-score: -1.5
We get that the left-tail proportion is 0.0668072
Since the alternative hypothesis is ha ≠ ..., so we're to calculate BOTH tails proportion.
Z-score as -1.5 means the point is at left tail, so we just need to multiply the proportion by 2
The p-value then is 0.0668072*2 = 0. 134

Making conclusions in a 「z test」 for a proportion

Calculate Z-value -> Convert to P-value -> Compare with ⍺ level -> Make decision.

Example

Solve:

solomonxie commented 5 years ago

❖ T Test (t statistics)

Refer to article: Understanding t-Tests: t-values and t-distributions

T-tests are all based on t-values. T-values are an example of what statisticians call Test statistics. A test statistic is a standardized value that is calculated from sample data during a hypothesis test. The procedure that calculates the test statistic compares your data to what is expected under the null hypothesis.

▶︎ Jump back to previous note on: T-score

▶︎ t-distribution online calculator

Formula of 「One-sample T test」 for Mean

The test statistic gives us an idea of how far away our sample result is from our null hypothesis.

For a one-sample t test for a mean, our test statistics is: (x⁻ is the Sample Mean, μ₀ is mean from null hypothesis, sx is the Sample SD, n is Sample size)

Understanding the formula: The statistic - parameter results the DISTANCE from Sample mean to Population mean. The Standard Error represents the DISTANCE from Sample SD to population SD. Therefore, dividing the Distance of mean by Distance of SD will results in a Normalized Distance for mean.

Calculating the 「test statistic」 in a t test for a mean

Example

Solve:

Calculating the 「P-value in a t test」 for a mean

Example

Solve:

Since the sample size = 11, so the degree of freedom is 10.
Take a T-table and look for a cross section of df=10 & t=1.368
Since our test statistic is slightly smaller than 1.372, the corresponding P-value will be slightly larger than 0.10
The P-value is approximately 0.101
Or another way is to use an online calculator: SurfStat t-distribution calculator

Example

Solve:

solomonxie commented 5 years ago

❖ 「Z Test」 vs. 「T Test」 (One-sample)

Refer to Khan academy: When to use z or t statistics in significance tests

▶ Jump back to previous note on: Z Test ▶ Jump back to previous note on: T Test

Formulas

Proportion -> Z-test Mean -> T-test

Z test for proportion:

T test for mean:

「Z-statistics」 vs. 「T-statistics」

Refer to Khan academy: Z-statistics vs. T-statistics Refer to Khan academy: Small sample hypothesis test Refer to Khan academy: Large sample proportion hypothesis testing

Large sample size -> Z-test Small sample size -> T-test

solomonxie commented 5 years ago

Two-tailed Test [DRAFT]

「One-tailed」 vs. 「Two-tailed」

Refer to Khan academy: One-tailed and two-tailed tests

solomonxie commented 5 years ago

Inference for comparison

Two-sample inference for the difference between groups

Conditions for 「Two-sample Z」

"same-o same-o".

The conditions we need for inference on two proportions are:

Random: The data need to come from TWO random samples or TWO groups in a randomized experiment.
Normal: The normal condition says that we need at least 10 successes and 10 failures in each group.
Independent: Individual observations in each group and the groups themselves need to be independent. If sampling without replacement, each sample size shouldn't be more than 10% of its population.

This interval doesn't require equal sample sizes from each population. The formulas we use allow for different sample sizes.

solomonxie commented 5 years ago

「Confidence Interval」 for Difference

Refer to Khan academy: Confidence intervals for the difference between two proportions

CI for 「Difference on Proportions」

▶ Jump back to previous note on: Z-intervals

Formula

We can make a confidence interval for the difference of two proportions like this:

Example

Solve:

Get Z-value from the 99% of the Confidence Level by assume μ=0 & 𝜎=1
Input μ=0, 𝜎=1, X=(1-0.99)/2=0.05 to a Z-score calculator, and get Z* = 2.576
Determining the standard error:

Example

Solve:

CI for 「Difference on Means」

▶ Jump back to previous note on: T-intervals

Refer to Khan academy: Constructing t interval for difference of means

Formula

We can make a confidence interval for the difference of two means like this:

Example

Solve:

Rewrite informations:
Calculate t* value, input confidence level and degree of freedom to a calculator, to get t=3.355:

solomonxie commented 5 years ago

❖ Significant Difference Test (Proportions)

"Hypothesis Test for difference", which sounds more intuitive with significant different test because it's testing for a significant difference.

Refer to Khan academy: Hypothesis test for difference in proportions

▶ Jump back to previous note on: One-sample Z Test

Reminder: One-sample Z Test

Formula for 「Two-Sample Z Test」

Combining the proportion of successes In this type of test, it's useful to first calculate the pooled (or combined) proportion of successes in both samples:

We do significance tests assuming that the null hypothesis is true. In this test, our null hypothesis is that the two population proportions are equal, but we don't have a hypothesized value for their common proportion. Our best estimate for this value is Ṕc. We'll use this pooled (or combined) value in the standard error formula where we'd ideally use each population proportion.

The Hypothesis difference is 0 when H₀: p1 - p2 = 0.

Z-value in 「Two-sample Z Test」

Example

Solve:

Combining the proportion of successes:
Input values to formula:

P-value in a 「Two-sample Z Test」

Example

Solve:

Calculate z-score:
Since the Alternative hypothesis is Ha: p1 ≠ p2, so we're to add up both tails of probabilities:

solomonxie commented 5 years ago

❖ 「Confidence Interval」 for 「Hypothesis Test」

Refer to Khan academy: Confidence interval for hypothesis test for difference in proportions

▶︎ Jump back to previous note on: Confidence Interval ▶︎ Jump back to previous note on: Significance Test

In a two-sided test, the null hypothesis says there is no difference between the two proportions. In other words, the null hypothesis says that the difference between the two proportions is 0.

We can use a confidence interval instead of a P-value for two-sided tests as long as the confidence level and significance level add up to 100%.

For example, That being said, if the Confidence Interval DOES NOT overlap with the Null Hypothesis Difference, 0 in this case, then the "true difference" will fall into the Significance Level, which should be reject.

CI exludes 0 ▶ P-value < Significance level ▶ Not reject
CI contains 0 ▶ P-value > Significance level ▶ Reject

Example

Solve:

Since the null hypothesis difference is 0 (H₀: pe-pw=0),
so we're to examine if the confidence interval contains the "assumed difference" 0
Surely the interval 0.09±0.086 contains 0,
therefore the hypothesis should NOT be rejected.

solomonxie commented 5 years ago

「Paired T Test」vs. 「Two-sample T Test」

Refer to Khan academy: Example of hypotheses for paired and two-sample t tests

Consider the design of the study:

Use a paired test if the study paired subjects in some way, or if the study compares two measurements on each subject.
Use a two-sample test if the data were produced from two independent groups.

Example

Solve:

Each one in the sample is either left-handed or right-handed, so they're in different groups, which tells us we're gonna use Two-sample Test in this case
It's asking for Alternative hypothesis, so it's:

Example

Solve:

Since their choice are "interchangeable", so they're DEPENDENT, and we're to use Paired Test

Example

Solve:

In this study, each time produced two measurements (number of cars in intersection A and number of cars in intersection B). Those measurements are almost certainly dependent in some way, so it wouldn't be appropriate to treat the intersection A measurements as being independent from the intersection B measurements. Since the measurements were paired for time, we should use a paired t test on the average difference between the measurements.
The Alternative Hypothesis is:

solomonxie commented 5 years ago

❖ Significant Difference Test (Means)

▶ Jump back to previous note on: One-sample T test

Reminder: One-sample T Test

Formula for 「Two-sample T Test」

The difference μ1 - μ2 comes from the null hypothesis. In this type of test, we assume μ1 = μ2 in the population means, which results in μ1 - μ2 =0.

「T-value」 for Two-sample Test

Example

Solve:

「P-value」 for Two-sample Test

For the Degree of freedom in the Two-sample Test, we're gonna use the SMALLER sample size.

Example

Solve:

Calculate the t-value to get t=2.12621542
Decide df, which will be the smaller sample size 46- 1 = 45
Since it's asking Ha: μ1 ≠ μ2, so we're to calculate both tails:
Get an online calculator and input the values:

Use 「CI」 to make conclusions about the 「difference of means」

▶︎ Jump back to previous note: Significance Testing

Normally, we can make conclusion simply by comparing P-value with Significance level.

But there're cases ask us to make conclusion by comparing Confidence level with Significance level. In that case, we can judge it by simply examine whether the Confidence interval covers 0 or not.

Since Confidence Level + Significance Level = 100%:

CI exlcudes 0 ▶ Smaller interval & larger significance ▶ Significance level > P-value ▶ Not reject
CI includes 0 ▶ Larger interval & smaller significance ▶ Significance level < P-value ▶ Reject

Example

Solve:

No, because the P-value > ⍺, means there's no sufficient evidence against the null hypothesis.

Example

Solve:

solomonxie commented 5 years ago

❖ 「Z Statistic」 vs. 「T Statistic」

▶ Jump back to previous note on: Z Test vs. T Test (One-sample)

「One-sample」

Refer to Khan academy: When to use z or t statistics in significance tests

▶ Jump back to previous note on: Z Test ▶ Jump back to previous note on: T Test

Formulas

Proportion -> Z-test Mean -> T-test

One-sample Z test for proportion:

One-sample T test for mean:

「Z-statistics」 vs. 「T-statistics」

Large sample size -> Z-test Small sample size -> T-test

「Two-sample」

▶ Jump back to previous note on: Significant Different Test (Proportions) ▶ Jump back to previous note on: Significant Different Test (Means)

Formulas

Proportion -> Z-test Mean -> T-test

Two-sample Z test for proportion:

Two-sample T test for mean:

Z & T 「Confidence Intervals」

Formulas

Z-score for CI:

T-score for CI:

solomonxie commented 5 years ago

❖ Chi-Squared Testing [DRAFT]

Chi is written as greek letter 𝐗, looks like x, reads "kai". Chi-squared is written as 𝐗².

Like Z-score in a Normal Distribution for a Z-test, T-score in a T-Distribution for a T-test, 𝐗² is the "Test statistic" in Chi-square Test, which converts sample data to a standardized value in a Chi-square Distribution.

Chi-square Test is a hypothesis test for categorical data.

Refer to wiki: Chi-squared test Refer to Khan academy: Chi-square statistic for hypothesis testing Refer to Crash course: Chi-Square Tests: Crash Course Statistics #29

Conditions for a goodness-of-fit test

Random condition: The data came from a random sample from the population of interest, or a randomized experiment.
Independent condition: If we sample without replacement, our sample size should be less than 10% of the population so we can assume independence between members in the sample.
Large counts condition Each Expected count need to be at least 5. (No conditions attached to the observed counts)

Example

Solve:

The large counts condition says that all expected counts need to be at least 5
Patrick needs to sample enough visits so that he expects each day of the week to appear at least 5 times. There are 7 days in the week, so he needs to sample at least 5*7=35 visits.

Formula of 「Chi-squared Test statistic」

How to understand the formula? It's not hard to see this a way to standardized the data:

Observed - Expected gets the Distance,
()² eliminates the negative results,
÷ Expected unweights the data, so that the value will fit to the Standardized Distribution, similar to the concepts of Unit Circle or Unit Vector.

「Chi-squared Distribution」

「P-value」 for Chi-squared Test

「Type 1」: Chi-square Goodness-of-Fit Test

Tests how well certain proportions fit our sample, which only has ONE variable(row).

「Type 2」: Tests of Independence`

Look to see whether being a member of ONE CATEGORY is independent of THE OTHER, which has TWO variables (rows).

「Type 3」: Tests of Homogeneity`

It's looking at whether it's likely that Different samples come from the Same population.

solomonxie commented 5 years ago

❖ Chi-square 「Goodness-of-fit Test」

Goodness-of-fit Test is good for testing a One-row Frequency Table. The test shows how well certain proportions fit our sample, which only has ONE variable(row).

Steps:

Choose a distribution (depends on the DF)
Complete the Frequency Table with Expected Counts for each data
Calculate the standardized Chi-square value 𝐗² according to the Expected & Observed
Use calculator to get the probability area P-value in the distribution according to 𝐗²

「Expected Frequencies」

Counting the Expected Frequency for data is the very first step and fundamental part of doing Chi-square Test.

The expected frequencies can be either a PRESET or the PROBABILITY of the data. The expected frequencies are set as Null Hypothesis in the test, and Observed frequencies are the Alternative Hypothesis against the null in the test.

"For a χ² goodness-of-fit test, the null hypothesis is that the population distribution of the categorical variable in question matches some hypothesized distribution. We use that hypothesized distribution to calculate the expected counts for each value of the variable."

Example

Solve:

Since the null hypothesis is 4 feeders has "equally likely" chance to feed the bird,
so the supposed chance for each feeder would be 350/4 = 87.5
and 87.5 is the Expected count for each feeder.

Test statistic 「𝐗²」

Chi-squared Test statistic Formula:

To calculate 𝐗², we need to COMPLETE the Frequency Table, with both Expected and Observed values:

Or you can see it as:

「P-value」

To calculate P-value we need the 𝐗² and DF:

For instance, it observes 3 prices for a fruit: prices of apple, orange, banana. Then there are 3 categories, or 3 variables. Therefore the DF (Degree of freedom) is (3-1)=2

Get an online chi-squared calculator, input the test statistic 𝐗² and DF, we'll get its P-value, like this:

Example

Solve:

Calculate the expected frequencies for each data:
Compare Observed and expected, to get Chi-squared:
Get a P-value calculator, input test statistic 𝐗²=10.5 and Degree of freedom df= 4-1 = 3:

Making conclusions in a 「goodness-of-fit Test」

To overturn the null hypothesis, we just to compare the P-value with Significance Level.

But there's another type of conclusion we can make: which component contributes the most to the test statistic. The way to do it, is simply look at each component's value, the bigger component the more it contributes.

Example

Solve: District B has the largest component because its observed count was farthest away from its expected count (relative to the expected count). So we can say that District B contributed the most to the 𝐗² test-statistic.

solomonxie commented 5 years ago

❖ Chi-squared Homogeneity Test

For the Chi-square homogeneity test we're gonna use this online calculator instead: ▶ Chi-Square Calculator

「Expected counts」

Intuitively, we just need to get the Ratio from the Total counts, and apply the ratio to the corresponding cell, as what we expected.
In general, it can be calculated with this formula:

The Ratio can either be RowTotal / TableTotal or ColumnTotal / TableTotal.

Example

Solve:

For the intuition, we just look at the Frequency Table, and get the Ratio from total counts, which is 65/262 in this case
and apply the ratio to the Row total, which is 90 in this case, so it's (65/262) * 90
Or we can just use the formula:
which gets us:

Test statistic 「𝐗²」

「P-value」

Degree of Freedom (DF) in the Chi-square Homogeneity Test would be:

Example

Solve:

Making conclusions in chi-square tests for 「two-way tables」

Refer to the "hint" of each practice problem: Making conclusions in chi-square tests for two-way tables

Selecting appropriate hypotheses The chi-square statistic is such a versatile tool that we can use the exact same calculations to answer very different questions with it, depending on whether we draw our data from one sample or from independent samples or groups.

Ⓐ Multiple independent Sample groups A chi-square test can help us when we want to know whether different populations or groups are alike with regards to the distribution of a variable. Our hypotheses would look something like this:

H₀: The distribution of a variable is the SAME in each population or group.
Ha: The distribution of a variable DIFFERS between some of the populations or groups.

We call this the chi-square test for Homogeneity

etc.,

Ⓑ One Sample group A chi-square test can help us see whether individuals from a sample who belong to a certain category are more likely than others in the sample to also belong to another category. Our hypotheses would look something like this:

H₀: There is NO association between the two variables (they are independent).
Ha: There IS an association between the two variables (they are not independent).

We call this the chi-square test of association/independence.

etc.,

Example

Solve:

The people in this study came from one sample, so a test of independence/association is most appropriate. We could state the null and alternative hypotheses in this problem as:
- Ho: There is no association between service and age.
- Ha: There is an association between service and age.
Since 𝐗² > ɑ, so we should fail to reject Ho.
So the conclusion is "This isn't enough evidence to say there is an association between service and age."

solomonxie commented 5 years ago

Orthogonal Least Squares

solomonxie commented 5 years ago

❖ 「Inference」 on Linear Regression

Conditions for 「inference on slope」 L-I-N-E-R

It can be concluded as L-I-N-E-R:

L: Linear condition (Has linear relationship between x&y )
I: Independent condition (Individual observations with replacement or 10% Rule)
N: Normal condition (Sample size is at least 30)
E: Equal variance condition
R: Random condition

「Confidence interval」 for slope

Here's the formula for estimating the slope:

Notice:

We're using T-interval for estimating slope
Degree of freedom(DF) becomes: n-2

Interpreting the output of 「Inference of Slope」

Example

Solve:

Interpret the table.
Collect essential values for calculating CI:
- Expected value of slope
- T-value
- Sample size
Calculate with formula

「T statistic」 for Slope

Here is the formula for T statistic for slope:

Example

Solve:

Use 「CI」 to make conclusions about 「slope」

▶︎ Jump back to previous note: Significance Testing

Normally, we can make conclusion simply by comparing P-value with Significance level.

But there're cases ask us to make conclusion by comparing Confidence level with Significance level. In that case, we can judge it by simply examine whether the Confidence interval covers 0 or not.

Since Confidence Level + Significance Level = 100%:

CI exlcudes 0 ▶ Smaller interval & larger significance ▶ Significance level > P-value ▶ Not reject
CI includes 0 ▶ Larger interval & smaller significance ▶ Significance level < P-value ▶ Reject

Example

Solve:

solomonxie commented 5 years ago

「Z」 or 「T」? [DRAFT]

Notations

Standardized Distribution:
- Normal distribution -> Z-score: z
- T distribution -> T-score: t
Confidence Interval
- Z interval: z* or z-value
- T interval: t* or t-value
Hypothesis Test
- Z statistic: z or z-value
- T statistic: t or t-value

There're two ways to get the statistic (z or t):

Convert Confidence Level to (z or t) value:
- z = Ztable( (1-Confidence level)/2 )
- t = Ttable( (1-Confidence level)/2, DF )
Calculate from Sample data: (Observed - Expected) / SE

solomonxie commented 5 years ago

Non-Linear Transformation

The goal is to transform Non-linear relationship to Linear relationship, which is much easier to calculate and predict.

Refer to Khan academy: Transforming nonlinear data Refer to article: Non-Linear Transformation

There are several non-linear curves that that can be transformed into linear curves.

Example

Solve:

Example

Solve:

solomonxie / blog-in-the-issues

Statistical Guessing 统计式瞎猜 #50

Study Resources

Tools

Khan academy AP Statistics

Machine Learning related topics

❖ Geometric Random Variables

「Geometric Random Variable」 vs. 「Binomial Random Variable」

Understanding「Geometric Probability」

Example

「Mean & Variance」 of Geometric Probability

「Cumulative Geometric Probability」

Example

Example

Example

❖ Sampling Distribution

「Bias」 of Sample Statistic

「Shape」 of Sample Distribution

Example

❖ Sample Mean

「Mean & Variance」 of Sample Means

Example

「Probability of sample mean」 exceeding a value

Example

❖ Sample Proportion

「Mean & Variance」 of Sample Proportions

Example

「Probabilities」 with Sample Proportions

Example

❖ Standard Error (SE) [DRAFT]

Why is it an 「Error」?

What if we don't know the 「Population variance」?

「Standard Error of Sample Mean」SEM

「Standard Error of Sample Proportion」

❖ 「Central Limit Theorem」 and 「Law of Large Numbers」 [DRAFT]

「Central Limit Theorem」

How 「Normal」 is the distribution

「Skewness」

「Kurtosis」 Tailedness

『Standard Deviation』

「Law of Large Numbers」 (LLN)

Strong 「Law of large numbers」

Weak 「Law of large numbers」

❖ Confidence Interval (CI)

「Inference」 & 「Inferential Statistics」

「Confidence Level」 Tolerance Level

「Point Estimator」 Expected Value

Why set the 「Point Estimator」 as 「Centre」?

「Sampling Error」

「Margin of Error」

Formula for both 「Z-interval」 & 「T-interval」

What 「z-score」 should you use?

~Calculating Confidence Interval~ [DEPRECATED]

Finding the Sampling Distribution

Finding the Confidence Limits

Inverse Cumulative Normal Probability

❖ 「Z-Interval」 Z statistics

Conditions for a valid 「Z Interval」

Formula of 「Z-interval」

Understanding the formula for 「Margin of Error」

「One-sample」 Z Interval

Example

Example

「Sample Size」 & 「Margin of Error」

Example

Example

Estimating 「Margin of Error」

Example

「Z-table」

❖ T-Interval (t statistics)

「T-Distribution」

「T-distribution」 vs. 「Normal distribution」

Conditions for a valid 「T Interval」

「T-score」

Understanding the formula

Formula of 「T-interval」

「One-sample」 T interval

Example

T interval for 「paired data」

「T-table」