solomonxie / blog-in-the-issues

A personalised tech-blog, notebook, diary, presentation and introduction.
https://solomonxie.github.io
65 stars 12 forks source link

Statistical Guessing 统计式瞎猜 #50

Open solomonxie opened 6 years ago

solomonxie commented 6 years ago

Statistics is all about PREDICTION: Given some real information, and predict what will happen next.

Study Resources

Tools

Khan academy AP Statistics

Machine Learning related topics

solomonxie commented 5 years ago
solomonxie commented 5 years ago
solomonxie commented 5 years ago

❖ Geometric Random Variables

Refer to wiki: Geometric distribution Refer to Khan academy: Geometric random variables introduction

「Geometric Random Variable」 vs. 「Binomial Random Variable」

image

In both settings, the trials are independent and the probability of success remains the same on each trial.

The only difference between Geometric R.V. and Binomial R.V. is that, The Geometric R.V. DOES NOT have a certain number of trails.

Requirements of Geometric R.V.:

Understanding「Geometric Probability」

The geometric distribution gives the probability that the first occurrence of success requires k independent trials, each with success probability p.

image

If it's asking for Number of Trails, then the number Trails = failures + success = (n-1) + 1. If it's asking for Number of failures, then the number Trails = Failures+1 = n+1. And that's why the formula is slightly different.

Assume p is the probability of success on each trail, n is the number of trails or failures:

Example

image Solve: image

「Mean & Variance」 of Geometric Probability

image

「Cumulative Geometric Probability」

We know how to calculate Geometric Probability at each value, but Cumulative G.P. would be bit tricky.

image

The formula literally means: FAIL a TIMES IN A ROW.

This formula is good for X>a, and with a bit twist you can get most out of it. etc.,

Example

image Solve:

Example

image Solve:

Example

image Solve: image

solomonxie commented 5 years ago
solomonxie commented 5 years ago

❖ Sampling Distribution

It's just taking out the parameters(Mean/SD..) from different samples of the SAME population, and make them as a distribution. etc., Distribution of means, Distribution of standard deviations...

Refer to Khan academy: Introduction to sampling distributions

「Bias」 of Sample Statistic

Refer to Khan academy: Sample statistic bias worked example

Example: The dotplots below show an approximation to the sampling distribution for three different estimators of the same population parameter, and the actual value of the population parameter is 2. image

「Shape」 of Sample Distribution

Usually the Sample Distribution is Normally distributed, only on these conditions:

It means that:

But under some extreme conditions it can also be skewed: The expected success items are less than 10 in the sample.

Example

image Solve:

solomonxie commented 5 years ago

❖ Sample Mean

Also called "Sampling distribution of Sample Mean".

「Mean & Variance」 of Sample Means

image

Example

image Solve: image

「Probability of sample mean」 exceeding a value

Refer to Khan academy: Example: Probability of sample mean exceeding a value

Example

image Solve:

solomonxie commented 5 years ago

❖ Sample Proportion

The full name is Sampling Distribution of the Sample Proportion, which's denoted by p-hat.

Refer to youtube: The Sampling Distribution of the Sample Proportion Refer to article: The Sample Proportion Refer to article: Sampling Distribution of the Sample Proportion, p-hat

Sample Proportion is the proportion of success in a sample.

Sample Proportion(p-hat) is a random variable, specifically a Binomial Random Variable.

So let X denotes the number of success in the sample, which is the Binomial Random Variable with parameter n and p,

image

「Mean & Variance」 of Sample Proportions

Recall that the binomial random variable X:

Hence, we derived the Mean & Variance of Sample Proportion p-hat from X: image image

Why is that?

image

That's why we say: p-hat is an unbiased estimator for p of population.

And for Standard Deviation of Sample Proportion: image

Example

image Solve: image

「Probabilities」 with Sample Proportions

It's just to find out the probability area in the Normal Distribution. All you need is the Mean, Standard Deviation and the point you're to measure.

Example

image Solve:

solomonxie commented 5 years ago

❖ Standard Error (SE) [DRAFT]

Standard Error is actually a short version of saying the Standard Deviation of the Sampling Distribution.

Refer to wiki: Standard error

image

Why is it an 「Error」?

Assume the Population mean is 0, then the Standard Error is the error/distance of the Sample Mean away from the True population mean.

What if we don't know the 「Population variance」?

「Standard Error of Sample Mean」SEM

If we're to find the Standard Deviation of the Mean of a Sampling Distribution, we call it as the Standard Error of the Mean (SEM).

Refer to Khan academy: Standard error of the Mean

Standard Error gives us how far the Sample Mean will deviate from the true mean.

image

「Standard Error of Sample Proportion」

solomonxie commented 5 years ago

❖ 「Central Limit Theorem」 and 「Law of Large Numbers」 [DRAFT]

Refer to article: Central Limit Theorem and Law of Large Numbers

The Central Limit Theorem is about the SHAPE of the distribution. The Law of Large Numbers tells us where the CENTRE (maximum point) of the bell is located.

「Central Limit Theorem」

One of the most fundamental & profound concepts in Statistics or even Mathematics.

Refer to youtube: Introduction to the Central Limit Theorem Refer to Khan academy: Central limit theorem

The theorem is about the Sample Mean, saying:

The distribution of Sample Mean tends towards the Normal Distribution as the Sample Size increases, regardless of the shape of Population Distribution.

As a very rough guidline, the Sample Mean is approximately Normally distributed if the Sample Size is at least 30.

That being said:

How 「Normal」 is the distribution

Refer to Khan academy: Sampling distribution of the sample mean

「Skewness」

image

「Kurtosis」 Tailedness

image

『Standard Deviation』

The SD tends to be smaller and smaller as the Sample Size increases or the more times you take samples.

「Law of Large Numbers」 (LLN)

Refer to Khan academy: Law of large numbers Refer to wiki: Law of large numbers

It is saying: The average of the results obtained from a large number of trials should be close to the expected value, and will tend to become closer as more trials are performed.

image

Strong 「Law of large numbers」

Weak 「Law of large numbers」

solomonxie commented 5 years ago

❖ Confidence Interval (CI)

Since there will always be sampling error for estimating the true population, so it's a good practice to have a confidence interval while doing estimation on samples.

Refer to youtube: Understanding Confidence Intervals: Statistics Help Refer to article: Confidence Level & Margin of Error

A Confidence Interval is a "Tolerance Interval", which statistically is an estimated RANGE of values that seem reasonable, which controls how accurate the estimation to be.

We've learnt how to estimate an exact value for Population parameters. But the estimation can't be too good if it's exact. Hence confidence interval gives a more reliable way to describe/guess the population.

image

「Inference」 & 「Inferential Statistics」

Inference means the conclusions we got from the sample to describe the population .

Inferential Statistics or Statistic inference means how we can go from describing data we already have to make inferences about data we don't have.

「Confidence Level」 Tolerance Level

Confidence Level is a decision we made, that how much precise we want the guessing to be. 95% is the confidence level people often use.

image

Width of Confidence Interval: The width of confidence interval will be affected by two things:

「Point Estimator」 Expected Value

You can literally call it the Best estimator, which is the best estimate of a population parameter.

A point estimate of a population parameter is a single value used to estimate the population parameter. For example, the sample mean x is a point estimate of the population mean μ.

Point Estimate uses sample data to calculate a single value which is to serve as a "best guess" or "best estimate" of an unknown population parameter (for example, the population mean). More formally, it is the application of a point estimator to the data to obtain a point estimate.

Point Estimate is often set to be the Sample Mean, as the Centre of Confidence Interval.

Why set the 「Point Estimator」 as 「Centre」?

Because the Point Estimate (Expected value) is our Best guess, and every value differs with that would be seen as an error. By stocking up all the errors around the "best guess" within our "Tolerance level" (Confidence level), we get a Confidence Interval.

「Sampling Error」

It's also called "Variation due to sampling".

Since the sample will NEVER BE PERFECT to represent the true population, so there will always be Sampling error.

「Margin of Error」

You can literally call it Limit of confidence or Confidence Limit.

We made up a decision of confidence level we want it to be, and we set the Sample Mean as the CENTRE of the range, which slice the interval to half:

image

image

The confidence limits (min/max) is given by this formula, which uses the Margin of Error:

image x = mean of the sample z = z-score representing the size of the confidence interval you have set, measured in units of standard deviations from the mean s = standard deviation of the sample n = number of entries in the sample

Formula for both 「Z-interval」 & 「T-interval」

image

What 「z-score」 should you use?

image

Confidence Z
80% 1.282
85% 1.440
90% 1.645
95% 1.960
99% 2.576
99.5% 2.807
99.9% 3.291
solomonxie commented 5 years ago

~Calculating Confidence Interval~ [DEPRECATED]

Calculating the Confidence Interval basically is just to convert a confidence level , say 95%, to a real value range, etc., (13kg, 28kg).

Refer to Head First Statistics: Chapter.12

There're a few ways to calculate:

We use the traditional Normal-based method more often.

Assume we've decided which population statistic we'll be estimating, and which level of confidence we need it to be, then there're 2 steps to calculate the Confidence Interval:

  1. Find the Sampling Distribution
  2. Find the Confidence Limits

Finding the Sampling Distribution

We know for constructing a Sampling Distribution we need the Mean & Variance: image

Since the population variance 𝜎² is UNKNOWN to us, so we are to estimate the population variance by the sample's "best estimator" (point estimator), which is Sample Variance s² in this case. image

is the Sample Variance, there're two types of formula: here is the common formula: image

here is the formula for sample proportion: image

We assume the Sampling Distribution is normally distributed.

Finding the Confidence Limits

Now we

Inverse Cumulative Normal Probability

Invert the given cumulative normal probability (Confidence Level) back to z-score.

We've learnt how to convert a percentile to z-score, and how about the Cumulative Normal Probability?

It's easy, in the graph we see that the Confidence Level is the middle part. if we cut the middle off, we'll get two tails, and either one can tell us the percentile position.

image

solomonxie commented 5 years ago

❖ 「Z-Interval」 Z statistics

Z interval is the Confidence Interval constructed using Z-score.

▶︎ Jump back to previous note on: Z-score

Conditions for a valid 「Z Interval」

The conditions we need for inference on one proportion are:

Formula of 「Z-interval」

image

Understanding the formula for 「Margin of Error」

Remember Standard Error is (X-μ)/Z, in which (X-μ) is the distance from Sample to Population, so called the Margin of Error, which is the thing we're looking for.
So doing the Z · (X-μ)/Z = (X-μ) is kinda Reversing the normalization of the distance back to the real distance.

「One-sample」 Z Interval

Only take the sample once from the population.

▶︎ Practice at Khan academy: Calculating a z interval for a proportion

▶︎ Tool: Omni Online Confidence Interval Calculator

Refer to Khan academy: Critical value (z*) for a given confidence level

Here is the formula for a one-sample z interval for a sample proportion:

image

in which the margin of error is:

image

Example

image Solve: image

Example

image Solve: image

「Sample Size」 & 「Margin of Error」

Example

image Solve: image

Example

image Solve: image

Estimating 「Margin of Error」

[Refer to Khan academy: Determining sample size based on confidence and margin of error](Determining sample size based on confidence and margin of error)

Example

image Solve:

「Z-table」

image

solomonxie commented 5 years ago

❖ T-Interval (t statistics)

T interval is good for situations where the sample size is small and population standard deviation is unknown.

When the sample size comes to be very small (n≤30), the Z-interval for calculating confidence interval becomes less reliable estimate. And here the T-interval comes into place.

Refer to Khan academy: Small sample size confidence intervals

「T-Distribution」

The full name is Student's t-distribution, which is a tweaked version of Normal Distribution.

Refer to wiki: Student's t-distribution

When the sample size is small, the Normal distribution will no longer be a good fit for estimating the population. So we introduced the tweaked version of Normal Distribution for a small sample sized sampling data, which we called T-distribution.

「T-distribution」 vs. 「Normal distribution」

They have the same centre: Sample Mean. But the tail of t-distribution is "fatter" than the Normal distribution.

image

Conditions for a valid 「T Interval」

The conditions we need for inference on one proportion are:

「T-score」

Refer to article: What is the T Score Formula?

A t score is one form of a Standardized Test Statistic (the other you’ll come across in elementary statistics is the z-score). The t score formula enables you to take an individual score and transform it into a standardized form>one which helps you to compare scores. You’ll want to use the t score formula when you don’t know the population standard deviation and you have a small sample (under 30).

The t score formula is: image (x⁻ is the Sample Mean, μ₀ is mean from null hypothesis, sx is the Sample SD, n is Sample size)

Understanding the formula

The statistic - parameter results the DISTANCE from Sample mean to Population mean. The Standard Error represents the DISTANCE from Sample SD to population SD. => Therefore, dividing the Distance of mean by Distance of SD will results in a Normalized Distance for mean.

▶︎ Jump back to previous note on: Standard Error

Formula of 「T-interval」

The difference with Z-interval's formula is instead of using Z* value, we'll be using the T* value, and the calculation of Standard Error is different too.

image

「One-sample」 T interval

image

Example

image Solve: image

T interval for 「paired data」

Refer to article on Khan academy: Making a t interval for paired data

「T-table」

image

solomonxie commented 5 years ago
solomonxie commented 5 years ago
solomonxie commented 5 years ago

❖ Hypothesis Testing

Hypothesis Testing is that we make a assumption, or a hypothesis about something, and we then make a test and do statistic on it as evidence to against the hypothesis.

We can NEVER prove the null hypothesis, because "INNOCENT UNTIL PROVEN GUILTY".

Refer to youtube: What is a Hypothesis Test and a P-Value?

image

「Null Hypothesis」 & 「Alternative Hypothesis」

Refer to youtube: Hypothesis Testing 2: null and alternative hypothesis (one sample t test)

Notations:

etc., if the null hypothesis is "Jason's IQ is 130", then the alternative hypothesis is "his IQ is below 130".

The 「Null Hypothesis」 H₀

The null hypothesis should always contain a statement of equality. Another way of thinking of it is that the null hypothesis is a statement of "no difference." We can write the null hypothesis in the form:

The 「Alternative Hypothesis」 Ha

The alternative hypothesis could take one of three forms, depending on the context of the test:

Example

image Solve: image

「Test statistic」

Test statistic is the Normalized value for the evidence in hypothesis, which could be:

Once you get the Test statistic value in a Normal Distribution, you'll easily get the probability area, which you could compare with the threshold.

「Simple Hypothesis Testing」

Example

image Solve:

Example

image Solve:

Conditions for 「Inference on a proportion」

When we want to carry out inferences on one proportion (build a confidence interval or do a significance test), the accuracy of our methods depend on a few conditions. Before doing the actual computations of the interval or test, it's important to check whether or not these conditions have been met, otherwise the calculations and conclusions that follow aren't actually valid.

The conditions we need for inference on one proportion are:

solomonxie commented 5 years ago

❖ P-value (One-tailed)

p-value stands for "probability value", which is the most confusing concept in Hypothesis testing. So it's necessary to pick it out here before exceeding to the Significance Testing.

Refer to youtube: Hypothesis Testing 5: p values (one sample t test)

p-value tells the MAXIMUM of the "truth" takes part in your story.

The smaller the true part (p-value) in the story, the greater the evidence against the story(null hypothesis).

For example, you said your IQ is 130. So we build a MODEL based on your claim. And then we ask you to take a real IQ test which tells your IQ is 117. So the calculation tells that the REAL part only takes utmost 0.47% in your "story". Therefore, if there is only less than 0.47% truth in a story, we can claim the story is a LIE! And 0.47% is the p-value.

image

Steps to calculate 「p-value」

image

p-value from 「Discrete Distribution」

Just to count how many outcomes are "further" than the sample data to the mean, divided by the total outcomes.

image

Example

image Solve: image

Example

image Solve: image

p-value from 「Continuous Distribution」

p-value is a 「Conditional Probability」

▶︎ Jump back to previous note: Conditional Probability

image

image

solomonxie commented 5 years ago

❖ Significance Testing

Also called the Null Hypothesis Significance Testing.

We design a Significance Test to evaluate the strength of the evidence against some null hypothesis. The alternative hypothesis is the claim we are trying to find evidence in favor of.

The Significance Test involves with these concepts:

「Significance Level」 ⍺, alpha, threshold

For making a "judge" on wether the hypothesis stands or fails, we need a standard or threshold to judge it, which we called the Significance Level, or the Cutoff, denoted (alpha).

There are a few common sets on the significance level:

image

「Critical Values」 & 「Rejection Regions」

Refer to youtube: Hypothesis Testing 4: critical values and rejection regions (one sample t test)

image

「p-value」

p-value tells the MAXIMUM of the "truth" takes part in your story.

▶︎ Jump back to previous note: p-value

Steps of 「Significance Testing」

Refer to article on Khan academy: Using P-values to make conclusions

image

Use 「P-value」 to make conclusion

image

image

Use 「Confidence interval」 to make conclusion

Refer to Khan academy: Confidence interval for hypothesis test for difference in proportions

▶︎ Jump back to previous note on: Confidence Interval ▶︎ Jump back to previous note on: Significance Test

In a two-sided test, the null hypothesis says there is no difference between the two proportions. In other words, the null hypothesis says that the difference between the two proportions is 0.

We can use a confidence interval instead of a P-value for two-sided tests as long as the confidence level and significance level add up to 100%. image

For example, image That being said, if the Confidence Interval DOES NOT overlap with the Null Hypothesis Difference, 0 in this case, then the "true difference" will fall into the Significance Level, which should be reject.

Since Confidence Level + Significance Level = 100%:

image

Example

image Solve:

solomonxie commented 5 years ago

❖ Testing Errors (Mistakes)

image

「Type 𝐈 Errors」 & 「Type 𝐈𝐈 Errors」`

Type I & Type II Errors are conditional probabilities given the hypothesis is true or false.

Refer to Khan academy: Introduction to Type I and Type II errors

What will happen when we're to reject a hypothesis?

The good thing to do is to Reject the false and Not reject the truth. The bad thing (error) to do is to Reject the truth and Not reject the false, which are called the Type I Error and Type II Error.

「Truth World」 & 「Lies World」

Although both of them are mistakes for doing the right thing, but they're under totally opposite conditions: One is existing in the Truth World, the other in the Lies World

Since they're "living in a different world", and they're Conditional Probabilities, so the calculation is different too:

image

Example

Jump to Khan academy for practice: Type I vs Type II error

image

Solve: image

solomonxie commented 5 years ago

❖ Statistical Power

Power is in the context of Statistical Testing, stands for a conditional probability of REJECTING a FALSE HYPOTHESIS.

"power -> power of justice -> ability to remove the bad one"

image

「Truth World」 & 「Lies World」

Note that, the Power is a Conditional Probability, on the condition of False null hypothesis.

That being said, the distribution is NOT built on the null hypothesis is true anymore, but on the null hypothesis is false.

image

How to increase 「Power」

Power is the likelihood that our sample result leads us to correctly reject a false null hypothesis.

The main purpose of studying the power is to get more chance to do the RIGHT thing.

There're two main settings affect the power of a significance test:

image

Impact of 「Significance Level」 ⍺

The logic is:

image

Impact of 「Sample Size」

Larger sample sizes increase power.

Example

image Solve:

Example

image Solve:

solomonxie commented 5 years ago
solomonxie commented 5 years ago

❖ Z Test (z statistics)

Z Test is a test constructed using the Z-score

▶︎ Jump back to previous note on: Z-score ▶︎ Jump back to previous note on: Z-interval

Formula of 「Z Test Statistic for proportion」

▶︎ Jump back to previous note on: Sample Proportion

The test statistic gives us an idea of how far away our sample result is from our null hypothesis. For a one-sample z-test for a proportion, our test statistic is:

image (which p^ is the Sample proportion, p₀ is the proportion from null hypothesis, n is sample size)

Understanding the formula: The statistic - parameter results the DISTANCE from Sample proportion to Population proportion. The Standard Deviation of statistic represents the ~DISTANCE from Sample SD to population SD.~ Therefore, dividing the Distance of proportion by Distance of SD will results in a Normalized Distance for proportion.

Calculating 「Z Test」 about a proportion

Refer to Khan academy: Calculating a z statistic in a test about a proportion

Example

image Solve: image

Calculating a 「P-value」 given a z statistic

Refer to Khan academy: Calculating a P-value given a z statistic

Example

image Solve: image

Example

image Solve:

Example

image Solve:

Making conclusions in a 「z test」 for a proportion

Calculate Z-value -> Convert to P-value -> Compare with ⍺ level -> Make decision.

Example

image Solve: image

solomonxie commented 5 years ago
solomonxie commented 5 years ago

❖ T Test (t statistics)

Refer to article: Understanding t-Tests: t-values and t-distributions

T-tests are all based on t-values. T-values are an example of what statisticians call Test statistics. A test statistic is a standardized value that is calculated from sample data during a hypothesis test. The procedure that calculates the test statistic compares your data to what is expected under the null hypothesis.

▶︎ Jump back to previous note on: T-score

▶︎ t-distribution online calculator

Formula of 「One-sample T test」 for Mean

The test statistic gives us an idea of how far away our sample result is from our null hypothesis.

For a one-sample t test for a mean, our test statistics is: image (x⁻ is the Sample Mean, μ₀ is mean from null hypothesis, sx is the Sample SD, n is Sample size)

Understanding the formula: The statistic - parameter results the DISTANCE from Sample mean to Population mean. The Standard Error represents the DISTANCE from Sample SD to population SD. Therefore, dividing the Distance of mean by Distance of SD will results in a Normalized Distance for mean.

Calculating the 「test statistic」 in a t test for a mean

Example

image Solve: image

Calculating the 「P-value in a t test」 for a mean

Example

image Solve:

Example

image Solve: image

solomonxie commented 5 years ago

❖ 「Z Test」 vs. 「T Test」 (One-sample)

Refer to Khan academy: When to use z or t statistics in significance tests

▶ Jump back to previous note on: Z Test ▶ Jump back to previous note on: T Test

Formulas

Proportion -> Z-test Mean -> T-test

Z test for proportion: image

T test for mean: image

「Z-statistics」 vs. 「T-statistics」

Refer to Khan academy: Z-statistics vs. T-statistics Refer to Khan academy: Small sample hypothesis test Refer to Khan academy: Large sample proportion hypothesis testing

Large sample size -> Z-test Small sample size -> T-test

image

solomonxie commented 5 years ago

Two-tailed Test [DRAFT]

「One-tailed」 vs. 「Two-tailed」

Refer to Khan academy: One-tailed and two-tailed tests

solomonxie commented 5 years ago
solomonxie commented 5 years ago

Inference for comparison

Two-sample inference for the difference between groups

Conditions for 「Two-sample Z」

"same-o same-o".

The conditions we need for inference on two proportions are:

This interval doesn't require equal sample sizes from each population. The formulas we use allow for different sample sizes.

image

solomonxie commented 5 years ago

「Confidence Interval」 for Difference

Refer to Khan academy: Confidence intervals for the difference between two proportions

image

CI for 「Difference on Proportions」

▶ Jump back to previous note on: Z-intervals

Formula

We can make a confidence interval for the difference of two proportions like this:

image

Example

image Solve:

Example

image Solve: image

CI for 「Difference on Means」

▶ Jump back to previous note on: T-intervals

Refer to Khan academy: Constructing t interval for difference of means

Formula

We can make a confidence interval for the difference of two means like this: image

Example

image Solve:

solomonxie commented 5 years ago

❖ Significant Difference Test (Proportions)

"Hypothesis Test for difference", which sounds more intuitive with significant different test because it's testing for a significant difference.

Refer to Khan academy: Hypothesis test for difference in proportions

▶ Jump back to previous note on: One-sample Z Test

Reminder: One-sample Z Test image

Formula for 「Two-Sample Z Test」

Combining the proportion of successes In this type of test, it's useful to first calculate the pooled (or combined) proportion of successes in both samples:

image

We do significance tests assuming that the null hypothesis is true. In this test, our null hypothesis is that the two population proportions are equal, but we don't have a hypothesized value for their common proportion. Our best estimate for this value is Ṕc. We'll use this pooled (or combined) value in the standard error formula where we'd ideally use each population proportion.

image

The Hypothesis difference is 0 when H₀: p1 - p2 = 0.

Z-value in 「Two-sample Z Test」

Example

image Solve:

P-value in a 「Two-sample Z Test」

Example

image Solve:

solomonxie commented 5 years ago

❖ 「Confidence Interval」 for 「Hypothesis Test」

Refer to Khan academy: Confidence interval for hypothesis test for difference in proportions

▶︎ Jump back to previous note on: Confidence Interval ▶︎ Jump back to previous note on: Significance Test

In a two-sided test, the null hypothesis says there is no difference between the two proportions. In other words, the null hypothesis says that the difference between the two proportions is 0.

We can use a confidence interval instead of a P-value for two-sided tests as long as the confidence level and significance level add up to 100%. image

For example, image That being said, if the Confidence Interval DOES NOT overlap with the Null Hypothesis Difference, 0 in this case, then the "true difference" will fall into the Significance Level, which should be reject.

image

Example

image Solve:

solomonxie commented 5 years ago

「Paired T Test」vs. 「Two-sample T Test」

Refer to Khan academy: Example of hypotheses for paired and two-sample t tests

Consider the design of the study:

image

Example

image Solve:

Example

image Solve:

Example

image Solve:

solomonxie commented 5 years ago

❖ Significant Difference Test (Means)

▶ Jump back to previous note on: One-sample T test

Reminder: One-sample T Test image

Formula for 「Two-sample T Test」

image

The difference μ1 - μ2 comes from the null hypothesis. In this type of test, we assume μ1 = μ2 in the population means, which results in μ1 - μ2 =0.

「T-value」 for Two-sample Test

Example

image Solve: image

「P-value」 for Two-sample Test

For the Degree of freedom in the Two-sample Test, we're gonna use the SMALLER sample size.

Example

image Solve:

Use 「CI」 to make conclusions about the 「difference of means」

▶︎ Jump back to previous note: Significance Testing

Normally, we can make conclusion simply by comparing P-value with Significance level.

But there're cases ask us to make conclusion by comparing Confidence level with Significance level. In that case, we can judge it by simply examine whether the Confidence interval covers 0 or not.

Since Confidence Level + Significance Level = 100%:

Example

image Solve:

Example

image Solve: image

solomonxie commented 5 years ago

❖ 「Z Statistic」 vs. 「T Statistic」

▶ Jump back to previous note on: Z Test vs. T Test (One-sample)

image

「One-sample」

Refer to Khan academy: When to use z or t statistics in significance tests

▶ Jump back to previous note on: Z Test ▶ Jump back to previous note on: T Test

Formulas

Proportion -> Z-test Mean -> T-test

One-sample Z test for proportion: image

One-sample T test for mean: image

「Z-statistics」 vs. 「T-statistics」

Large sample size -> Z-test Small sample size -> T-test

「Two-sample」

▶ Jump back to previous note on: Significant Different Test (Proportions) ▶ Jump back to previous note on: Significant Different Test (Means)

Formulas

Proportion -> Z-test Mean -> T-test

Two-sample Z test for proportion: image

Two-sample T test for mean: image

Z & T 「Confidence Intervals」

Formulas

Z-score for CI: image

T-score for CI: image

solomonxie commented 5 years ago
solomonxie commented 5 years ago
solomonxie commented 5 years ago

❖ Chi-Squared Testing [DRAFT]

Chi is written as greek letter 𝐗, looks like x, reads "kai". Chi-squared is written as 𝐗².

Like Z-score in a Normal Distribution for a Z-test, T-score in a T-Distribution for a T-test, 𝐗² is the "Test statistic" in Chi-square Test, which converts sample data to a standardized value in a Chi-square Distribution.

Chi-square Test is a hypothesis test for categorical data.

Refer to wiki: Chi-squared test Refer to Khan academy: Chi-square statistic for hypothesis testing Refer to Crash course: Chi-Square Tests: Crash Course Statistics #29

Conditions for a goodness-of-fit test

Example

image Solve:

Formula of 「Chi-squared Test statistic」

image

How to understand the formula? It's not hard to see this a way to standardized the data:

「Chi-squared Distribution」

image

「P-value」 for Chi-squared Test

image

「Type 1」: Chi-square Goodness-of-Fit Test

Tests how well certain proportions fit our sample, which only has ONE variable(row).

「Type 2」: Tests of Independence`

Look to see whether being a member of ONE CATEGORY is independent of THE OTHER, which has TWO variables (rows).

「Type 3」: Tests of Homogeneity`

It's looking at whether it's likely that Different samples come from the Same population.

solomonxie commented 5 years ago

❖ Chi-square 「Goodness-of-fit Test」

Goodness-of-fit Test is good for testing a One-row Frequency Table. The test shows how well certain proportions fit our sample, which only has ONE variable(row).

Steps:

「Expected Frequencies」

Counting the Expected Frequency for data is the very first step and fundamental part of doing Chi-square Test.

The expected frequencies can be either a PRESET or the PROBABILITY of the data. The expected frequencies are set as Null Hypothesis in the test, and Observed frequencies are the Alternative Hypothesis against the null in the test.

"For a χ² goodness-of-fit test, the null hypothesis is that the population distribution of the categorical variable in question matches some hypothesized distribution. We use that hypothesized distribution to calculate the expected counts for each value of the variable."

Example

image Solve:

Test statistic 「𝐗²」

Chi-squared Test statistic Formula: image

To calculate 𝐗², we need to COMPLETE the Frequency Table, with both Expected and Observed values: image

Or you can see it as: image

「P-value」

To calculate P-value we need the 𝐗² and DF: image

For instance, it observes 3 prices for a fruit: prices of apple, orange, banana. Then there are 3 categories, or 3 variables. Therefore the DF (Degree of freedom) is (3-1)=2

Get an online chi-squared calculator, input the test statistic 𝐗² and DF, we'll get its P-value, like this:

image

Example

image Solve:

Making conclusions in a 「goodness-of-fit Test」

To overturn the null hypothesis, we just to compare the P-value with Significance Level.

But there's another type of conclusion we can make: which component contributes the most to the test statistic. The way to do it, is simply look at each component's value, the bigger component the more it contributes.

Example

image Solve: District B has the largest component because its observed count was farthest away from its expected count (relative to the expected count). So we can say that District B contributed the most to the 𝐗² test-statistic.

solomonxie commented 5 years ago

❖ Chi-squared Homogeneity Test

image

For the Chi-square homogeneity test we're gonna use this online calculator instead: ▶ Chi-Square Calculator

「Expected counts」

The Ratio can either be RowTotal / TableTotal or ColumnTotal / TableTotal.

Example

image Solve:

Test statistic 「𝐗²」

image

「P-value」

Degree of Freedom (DF) in the Chi-square Homogeneity Test would be: image

Example

image Solve: image

Making conclusions in chi-square tests for 「two-way tables」

Refer to the "hint" of each practice problem: Making conclusions in chi-square tests for two-way tables

Selecting appropriate hypotheses The chi-square statistic is such a versatile tool that we can use the exact same calculations to answer very different questions with it, depending on whether we draw our data from one sample or from independent samples or groups.

Ⓐ Multiple independent Sample groups A chi-square test can help us when we want to know whether different populations or groups are alike with regards to the distribution of a variable. Our hypotheses would look something like this:

We call this the chi-square test for Homogeneity

etc., image

Ⓑ One Sample group A chi-square test can help us see whether individuals from a sample who belong to a certain category are more likely than others in the sample to also belong to another category. Our hypotheses would look something like this:

We call this the chi-square test of association/independence.

etc., image

Example

image Solve:

solomonxie commented 5 years ago
solomonxie commented 5 years ago
solomonxie commented 5 years ago

Orthogonal Least Squares

image

image

solomonxie commented 5 years ago

❖ 「Inference」 on Linear Regression

Conditions for 「inference on slope」 L-I-N-E-R

It can be concluded as L-I-N-E-R:

「Confidence interval」 for slope

Here's the formula for estimating the slope:

image

Notice:

Interpreting the output of 「Inference of Slope」

image

Example

image Solve:

「T statistic」 for Slope

Here is the formula for T statistic for slope: image

Example

image Solve: image

Use 「CI」 to make conclusions about 「slope」

▶︎ Jump back to previous note: Significance Testing

Normally, we can make conclusion simply by comparing P-value with Significance level.

But there're cases ask us to make conclusion by comparing Confidence level with Significance level. In that case, we can judge it by simply examine whether the Confidence interval covers 0 or not.

Since Confidence Level + Significance Level = 100%:

Example

image Solve: image

solomonxie commented 5 years ago

「Z」 or 「T」? [DRAFT]

Notations

There're two ways to get the statistic (z or t):

solomonxie commented 5 years ago
solomonxie commented 5 years ago
solomonxie commented 5 years ago

Non-Linear Transformation

The goal is to transform Non-linear relationship to Linear relationship, which is much easier to calculate and predict.

Refer to Khan academy: Transforming nonlinear data Refer to article: Non-Linear Transformation

There are several non-linear curves that that can be transformed into linear curves.

image

Example

image Solve: image

Example

image Solve: image