Open solomonxie opened 6 years ago
Refer to wiki: Geometric distribution Refer to Khan academy: Geometric random variables introduction
binomial setting
has a set number of trials, and the variable in question is the number of successes that occur in those trials.geometric setting
DOES NOT have a set number of trials, and the variable in question is the number of trials it takes to get the first success.In both settings, the trials are independent and the probability of success remains the same on each trial.
The only difference between Geometric R.V.
and Binomial R.V.
is that,
The Geometric R.V. DOES NOT have a certain number of trails.
Requirements of Geometric R.V.:
p
: Same probability p on each trail.Yes-no question
: Each trail's outcome is either success or failure.Independent
: Each trail is independent to each other.The geometric distribution gives the probability that the first occurrence of success requires k independent trials, each with success probability p.
If it's asking for Number of Trails
, then the number Trails = failures + success = (n-1) + 1
.
If it's asking for Number of failures
, then the number Trails = Failures+1 = n+1
.
And that's why the formula is slightly different.
Assume p
is the probability of success on each trail, n
is the number of trails or failures:
Number of Trials
until the first success:
Number of Failures
until the first success:
Solve:
We know how to calculate Geometric Probability at each value, but Cumulative G.P.
would be bit tricky.
The formula literally means: FAIL a TIMES IN A ROW.
This formula is good for X>a
, and with a bit twist you can get most out of it.
etc.,
P(X≥4)
is the same with P(X>3)
P(X≤5)
is the same with 1 - P(X>5)
P(X<7)
is the same with 1 - P(X>6)
Solve:
P(X<5) = 1 - P(X>4) = 1 - Failure⁴ = 1-0.9^4 = 0.34
P(C≤4) = P(C=1) + P(C=2) + P(C=3) + P(C=4)
P(C≤4) = 0.9⁰*0.1 + 0.9*0.1 + 0.9²*0.1 + 0.9³*0.1
Geometric Series
, so we can apply the formula:
P(C≤4) = 0.1 * (1-0.9⁴) / (1-0.9) = 0.34
Solve:
Solve:
It's just taking out the parameters(Mean/SD..) from different samples of the SAME population, and make them as a distribution. etc., Distribution of means, Distribution of standard deviations...
Refer to Khan academy: Introduction to sampling distributions
Refer to Khan academy: Sample statistic bias worked example
Example: The dotplots below show an approximation to the sampling distribution for three different estimators of the same population parameter, and the actual value of the population parameter is 2.
Usually the Sample Distribution is Normally distributed, only on these conditions:
It means that:
np ≥ 10
, andn(1-p) ≥ 10
.But under some extreme conditions it can also be skewed: The expected success items are less than 10 in the sample.
Solve:
Also called "Sampling distribution of Sample Mean".
Solve:
Refer to Khan academy: Example: Probability of sample mean exceeding a value
Solve:
The full name is
Sampling Distribution of the Sample Proportion
, which's denoted byp-hat
.
Refer to youtube: The Sampling Distribution of the Sample Proportion Refer to article: The Sample Proportion Refer to article: Sampling Distribution of the Sample Proportion, p-hat
Sample Proportion
is the proportion of success in a sample.
Sample Proportion(p-hat) is a random variable, specifically a
Binomial Random Variable
.
So let X denotes the number of success in the sample, which is the Binomial Random Variable with parameter n and p,
Recall that the binomial random variable X:
Hence, we derived the Mean & Variance of Sample Proportion p-hat
from X:
Why is that?
That's why we say:
p-hat
is an unbiased estimator for p
of population.
And for Standard Deviation of Sample Proportion:
Solve:
It's just to find out the probability area in the Normal Distribution. All you need is the Mean, Standard Deviation and the point you're to measure.
Solve:
0.63
0.019
0.60
0.06
Standard Error
is actually a short version of saying theStandard Deviation of the Sampling Distribution
.
Assume the Population mean is 0, then the Standard Error
is the error/distance of the Sample Mean away from the True population mean.
If we're to find the Standard Deviation of the Mean of a Sampling Distribution
, we call it as the Standard Error of the Mean
(SEM).
Refer to Khan academy: Standard error of the Mean
Standard Error gives us how far the Sample Mean will deviate from the true mean.
Refer to article: Central Limit Theorem and Law of Large Numbers
The Central Limit Theorem is about the SHAPE of the distribution. The Law of Large Numbers tells us where the CENTRE (maximum point) of the bell is located.
One of the most fundamental & profound concepts in Statistics or even Mathematics.
Refer to youtube: Introduction to the Central Limit Theorem Refer to Khan academy: Central limit theorem
The theorem is about the Sample Mean
, saying:
The distribution of Sample Mean tends towards the Normal Distribution as the Sample Size increases, regardless of the shape of Population Distribution.
As a very rough guidline, the Sample Mean is approximately Normally distributed if the Sample Size is at least 30.
That being said:
Sample Size < 30
, the shape of Sample Mean Distribution will be matching the shape of Population Distribution.Sample Size ≥ 30
, the shape of Sample Mean Distribution will be Normally distributed, regardless what the shape of Population Distribution is.Refer to Khan academy: Sampling distribution of the sample mean
The SD tends to be smaller and smaller as the Sample Size increases or the more times you take samples.
Refer to Khan academy: Law of large numbers Refer to wiki: Law of large numbers
It is saying: The average of the results obtained from a large number of trials should be close to the expected value, and will tend to become closer as more trials are performed.
Since there will always be sampling error for estimating the true population, so it's a good practice to have a confidence interval while doing estimation on samples.
Refer to youtube: Understanding Confidence Intervals: Statistics Help Refer to article: Confidence Level & Margin of Error
A Confidence Interval is a "Tolerance Interval", which statistically is an estimated RANGE of values that seem reasonable, which controls how accurate the estimation to be.
We've learnt how to estimate an exact value for Population parameters. But the estimation can't be too good if it's exact. Hence confidence interval gives a more reliable way to describe/guess the population.
Inference means the conclusions we got from the sample to describe the population .
Inferential Statistics
or Statistic inference
means how we can go from describing data we already have to make inferences about data we don't have.
Confidence Level is a decision we made, that how much precise we want the guessing to be. 95% is the confidence level people often use.
Width of Confidence Interval: The width of confidence interval will be affected by two things:
You can literally call it the Best estimator, which is the best estimate of a population parameter.
A point estimate of a population parameter is a single value used to estimate the population parameter. For example, the sample mean x is a point estimate of the population mean μ.
Point Estimate uses sample data to calculate a single value which is to serve as a "best guess" or "best estimate" of an unknown population parameter (for example, the population mean). More formally, it is the application of a
point estimator
to the data to obtain a point estimate.
Point Estimate is often set to be the Sample Mean, as the Centre of Confidence Interval.
Because the Point Estimate (Expected value) is our Best guess, and every value differs with that would be seen as an error. By stocking up all the errors around the "best guess" within our "Tolerance level" (Confidence level), we get a Confidence Interval.
It's also called "Variation due to sampling".
Since the sample will NEVER BE PERFECT to represent the true population, so there will always be Sampling error.
You can literally call it
Limit of confidence
orConfidence Limit
.
We made up a decision of confidence level we want it to be, and we set the Sample Mean as the CENTRE of the range, which slice the interval to half:
The confidence limits (min/max) is given by this formula,
which uses the Margin of Error
:
x = mean of the sample z = z-score representing the size of the confidence interval you have set, measured in units of standard deviations from the mean s = standard deviation of the sample n = number of entries in the sample
Confidence | Z |
---|---|
80% | 1.282 |
85% | 1.440 |
90% | 1.645 |
95% | 1.960 |
99% | 2.576 |
99.5% | 2.807 |
99.9% | 3.291 |
Calculating the Confidence Interval basically is just to convert a confidence level , say 95%, to a real value range, etc., (13kg, 28kg).
Refer to Head First Statistics: Chapter.12
There're a few ways to calculate:
We use the traditional Normal-based method more often.
Assume we've decided which population statistic we'll be estimating, and which level of confidence we need it to be, then there're 2 steps to calculate the Confidence Interval:
We know for constructing a Sampling Distribution we need the Mean & Variance:
Since the population variance 𝜎²
is UNKNOWN to us,
so we are to estimate the population variance by the sample's "best estimator" (point estimator), which is Sample Variance s² in this case.
s² is the Sample Variance, there're two types of formula: here is the common formula:
here is the formula for sample proportion:
We assume the Sampling Distribution is normally distributed.
Now we
Invert the given cumulative normal probability (Confidence Level) back to z-score.
We've learnt how to convert a percentile to z-score, and how about the Cumulative Normal Probability
?
It's easy, in the graph we see that the Confidence Level is the middle part. if we cut the middle off, we'll get two tails, and either one can tell us the percentile position.
Z interval is the Confidence Interval constructed using Z-score
.
▶︎ Jump back to previous note on: Z-score
The conditions we need for inference on one proportion are:
Remember Standard Error is (X-μ)/Z
,
in which (X-μ)
is the distance from Sample to Population, so called the Margin of Error
,
which is the thing we're looking for.
So doing the Z · (X-μ)/Z = (X-μ)
is kinda Reversing the normalization of the distance
back to the real distance
.
Only take the sample once from the population.
▶︎ Practice at Khan academy: Calculating a z interval for a proportion
▶︎ Tool: Omni Online Confidence Interval Calculator
Refer to Khan academy: Critical value (z*) for a given confidence level
Here is the formula for a one-sample z interval for a sample proportion
:
in which the margin of error is:
Solve:
Solve:
Solve:
Solve:
[Refer to Khan academy: Determining sample size based on confidence and margin of error](Determining sample size based on confidence and margin of error)
Solve:
p(1-p)
to be largest.p(1-p)
.p(1-p)
and set to 0, to get the max value: p = 0.5
T interval
is good for situations where the sample size is small and population standard deviation is unknown.
When the sample size comes to be very small (n≤30), the Z-interval
for calculating confidence interval becomes less reliable estimate. And here the T-interval
comes into place.
Refer to Khan academy: Small sample size confidence intervals
The full name is
Student's t-distribution
, which is a tweaked version of Normal Distribution.
Refer to wiki: Student's t-distribution
When the sample size is small, the Normal distribution will no longer be a good fit for estimating the population.
So we introduced the tweaked version of Normal Distribution for a small sample sized sampling data, which we called T-distribution
.
They have the same centre: Sample Mean. But the tail of t-distribution is "fatter" than the Normal distribution.
The conditions we need for inference on one proportion are:
Refer to article: What is the T Score Formula?
A t score
is one form of a Standardized Test Statistic (the other you’ll come across in elementary statistics is the z-score).
The t score formula enables you to take an individual score and transform it into a standardized form>one which helps you to compare scores.
You’ll want to use the t score formula when you don’t know the population standard deviation and you have a small sample (under 30).
The t score formula is: (x⁻ is the Sample Mean, μ₀ is mean from null hypothesis, sx is the Sample SD, n is Sample size)
The statistic - parameter
results the DISTANCE from Sample mean to Population mean.
The Standard Error
represents the DISTANCE from Sample SD to population SD.
=> Therefore, dividing the Distance of mean
by Distance of SD
will results in a Normalized Distance for mean.
▶︎ Jump back to previous note on: Standard Error
The difference with Z-interval's formula is instead of using Z*
value, we'll be using the T*
value,
and the calculation of Standard Error is different too.
Solve:
Refer to article on Khan academy: Making a t interval for paired data
Hypothesis Testing
is that we make a assumption, or a hypothesis about something, and we then make a test and do statistic on it as evidence to against the hypothesis.
We can NEVER prove the null hypothesis, because "INNOCENT UNTIL PROVEN GUILTY".
Refer to youtube: What is a Hypothesis Test and a P-Value?
Refer to youtube: Hypothesis Testing 2: null and alternative hypothesis (one sample t test)
Notations:
H-knot
)
Null hypothesis is the assumption we claimed as our opinion.H-alternative
)
Alternative hypothesis is the opposition against the null hypothesis.etc., if the null hypothesis is "Jason's IQ is 130", then the alternative hypothesis is "his IQ is below 130".
The null hypothesis should always contain a statement of equality. Another way of thinking of it is that the null hypothesis is a statement of "no difference." We can write the null hypothesis in the form:
H₀: parameter = value
The alternative hypothesis could take one of three forms, depending on the context of the test:
Ha: parameter > value
Ha: parameter < value
Ha: parameter ≠ value
Solve:
Test statistic is the Normalized value for the evidence in hypothesis, which could be:
Once you get the Test statistic value in a Normal Distribution, you'll easily get the probability area, which you could compare with the threshold.
Solve:
Solve:
40%^3 = 6.4%
When we want to carry out inferences on one proportion (build a confidence interval or do a significance test), the accuracy of our methods depend on a few conditions. Before doing the actual computations of the interval or test, it's important to check whether or not these conditions have been met, otherwise the calculations and conclusions that follow aren't actually valid.
The conditions we need for inference on one proportion are:
p-value stands for "probability value", which is the most confusing concept in Hypothesis testing. So it's necessary to pick it out here before exceeding to the Significance Testing.
Refer to youtube: Hypothesis Testing 5: p values (one sample t test)
p-value tells the MAXIMUM of the "truth" takes part in your story.
The smaller the true part (p-value) in the story, the greater the evidence against the story(null hypothesis).
For example, you said your IQ is 130. So we build a MODEL based on your claim. And then we ask you to take a real IQ test which tells your IQ is 117. So the calculation tells that the REAL part only takes utmost 0.47% in your "story". Therefore, if there is only less than 0.47% truth in a story, we can claim the story is a LIE! And 0.47% is the p-value
.
left-tail
or right-tail
, then the proportion is two-tails adding up together.p-value
, as the utmost probability of the true story in hypothesis.Just to count how many outcomes are "further" than the sample data to the mean, divided by the total outcomes.
Solve:
Solve:
Also called the
Null Hypothesis Significance Testing
.
We design a Significance Test
to evaluate the strength of the evidence against some null hypothesis.
The alternative hypothesis is the claim we are trying to find evidence in favor of.
The Significance Test involves with these concepts:
For making a "judge" on wether the hypothesis stands or fails, we need a standard or threshold to judge it, which we called the Significance Level
, or the Cutoff
, denoted ⍺
(alpha).
There are a few common sets on the significance level:
Refer to youtube: Hypothesis Testing 4: critical values and rejection regions (one sample t test)
p-value tells the MAXIMUM of the "truth" takes part in your story.
▶︎ Jump back to previous note: p-value
Refer to article on Khan academy: Using P-values to make conclusions
Refer to Khan academy: Confidence interval for hypothesis test for difference in proportions
▶︎ Jump back to previous note on: Confidence Interval
▶︎ Jump back to previous note on: Significance Test
In a two-sided test, the null hypothesis says there is no difference between the two proportions. In other words, the null hypothesis says that the difference between the two proportions is 0.
We can use a confidence interval instead of a P-value for two-sided tests as long as the confidence level and significance level add up to 100%.
For example,
That being said, if the Confidence Interval
DOES NOT overlap with the Null Hypothesis Difference, 0 in this case, then the "true difference" will fall into the Significance Level, which should be reject.
Since Confidence Level + Significance Level = 100%
:
Solve:
0.09±0.086
contains 0, Type I & Type II Errors are conditional probabilities given the hypothesis is true or false.
Refer to Khan academy: Introduction to Type I and Type II errors
What will happen when we're to reject a hypothesis?
Type I Error
Power
Type II Error
The good thing to do is to Reject the false
and Not reject the truth
.
The bad thing (error) to do is to Reject the truth
and Not reject the false
, which are called the Type I Error
and Type II Error
.
Although both of them are mistakes for doing the right thing, but they're under totally opposite conditions:
One is existing in the Truth World
, the other in the Lies World
Type I Error
is in a "world" that the null hypothesis is actually true.Type II Error
is in a "world" that the null hypothesis is actually a lie.Since they're "living in a different world", and they're Conditional Probabilities
, so the calculation is different too:
Jump to Khan academy for practice: Type I vs Type II error
Solve:
Power is in the context of Statistical Testing, stands for a conditional probability of REJECTING a FALSE HYPOTHESIS.
"power -> power of justice -> ability to remove the bad one"
Note that, the Power is a Conditional Probability
, on the condition of False null hypothesis
.
That being said, the distribution is NOT built on the
null hypothesis is true
anymore, but on thenull hypothesis is false
.
Power is the likelihood that our sample result leads us to correctly reject a false null hypothesis.
The main purpose of studying the power is to get more chance to do the RIGHT thing.
There're two main settings affect the power of a significance test:
The logic is:
Larger sample sizes increase power.
Solve:
Solve:
Z Test is a test constructed using the
Z-score
▶︎ Jump back to previous note on: Z-score
▶︎ Jump back to previous note on: Z-interval
▶︎ Jump back to previous note on: Sample Proportion
The test statistic gives us an idea of how far away our sample result is from our null hypothesis. For a one-sample z-test for a proportion, our test statistic is:
(which p^ is the Sample proportion
, p₀ is the proportion from null hypothesis, n is sample size)
Understanding the formula: The
statistic - parameter
results the DISTANCE from Sample proportion to Population proportion. TheStandard Deviation of statistic
represents the ~DISTANCE from Sample SD to population SD.~ Therefore, dividing the Distance of proportion by Distance of SD will results in aNormalized Distance for proportion
.
Refer to Khan academy: Calculating a z statistic in a test about a proportion
Solve:
Refer to Khan academy: Calculating a P-value given a z statistic
Solve:
Solve:
0
at the centre, and SD as 1
Solve:
left-tail
proportion is 0.0668072
ha ≠ ...
, so we're to calculate BOTH tails proportion.-1.5
means the point is at left tail, so we just need to multiply the proportion by 2
p-value
then is 0.0668072*2 = 0. 134
Calculate Z-value -> Convert to P-value -> Compare with ⍺ level -> Make decision.
Solve:
Refer to article: Understanding t-Tests: t-values and t-distributions
T-tests are all based on t-values
.
T-values are an example of what statisticians call Test statistics
. A test statistic is a standardized value that is calculated from sample data during a hypothesis test.
The procedure that calculates the test statistic compares your data to what is expected under the null hypothesis.
▶︎ Jump back to previous note on: T-score
▶︎ t-distribution online calculator
The test statistic gives us an idea of how far away our sample result is from our null hypothesis.
For a one-sample t test for a mean, our test statistics is: (x⁻ is the Sample Mean, μ₀ is mean from null hypothesis, sx is the Sample SD, n is Sample size)
Understanding the formula: The
statistic - parameter
results the DISTANCE from Sample mean to Population mean. TheStandard Error
represents the DISTANCE from Sample SD to population SD. Therefore, dividing theDistance of mean
byDistance of SD
will results in a Normalized Distance for mean.
Solve:
Solve:
df=10 & t=1.368
Solve:
Refer to Khan academy: When to use z or t statistics in significance tests
▶ Jump back to previous note on: Z Test
▶ Jump back to previous note on: T Test
Proportion -> Z-test Mean -> T-test
Z test for proportion:
T test for mean:
Refer to Khan academy: Z-statistics vs. T-statistics Refer to Khan academy: Small sample hypothesis test Refer to Khan academy: Large sample proportion hypothesis testing
Large sample size -> Z-test Small sample size -> T-test
Two-sample inference for the difference between groups
"same-o same-o".
The conditions we need for inference on two proportions are:
This interval doesn't require equal sample sizes from each population. The formulas we use allow for different sample sizes.
Refer to Khan academy: Confidence intervals for the difference between two proportions
▶ Jump back to previous note on: Z-intervals
We can make a confidence interval for the difference of two proportions like this:
Solve:
μ=0 & 𝜎=1
μ=0, 𝜎=1, X=(1-0.99)/2=0.05
to a Z-score calculator, and get Z* = 2.576
Solve:
▶ Jump back to previous note on: T-intervals
Refer to Khan academy: Constructing t interval for difference of means
We can make a confidence interval for the difference of two means like this:
Solve:
t*
value, input confidence level and degree of freedom to a calculator, to get t=3.355:
"Hypothesis Test for difference", which sounds more intuitive with significant different test because it's testing for a significant difference.
Refer to Khan academy: Hypothesis test for difference in proportions
▶ Jump back to previous note on: One-sample Z Test
Reminder: One-sample Z Test
Combining the proportion of successes In this type of test, it's useful to first calculate the pooled (or combined) proportion of successes in both samples:
We do significance tests assuming that the null hypothesis is true. In this test, our null hypothesis is that the two population proportions are equal, but we don't have a hypothesized value for their common proportion. Our best estimate for this value is Ṕc. We'll use this pooled (or combined) value in the standard error formula where we'd ideally use each population proportion.
The Hypothesis difference
is 0 when H₀: p1 - p2 = 0
.
Solve:
Solve:
z-score
:
Ha: p1 ≠ p2
, so we're to add up both tails of probabilities:
Refer to Khan academy: Confidence interval for hypothesis test for difference in proportions
▶︎ Jump back to previous note on: Confidence Interval
▶︎ Jump back to previous note on: Significance Test
In a two-sided test, the null hypothesis says there is no difference between the two proportions. In other words, the null hypothesis says that the difference between the two proportions is 0.
We can use a confidence interval instead of a P-value for two-sided tests as long as the confidence level and significance level add up to 100%.
For example,
That being said, if the Confidence Interval
DOES NOT overlap with the Null Hypothesis Difference, 0 in this case, then the "true difference" will fall into the Significance Level, which should be reject.
Solve:
0.09±0.086
contains 0, Refer to Khan academy: Example of hypotheses for paired and two-sample t tests
Consider the design of the study:
Solve:
Two-sample Test
in this caseAlternative hypothesis
, so it's:
Solve:
Solve:
▶ Jump back to previous note on: One-sample T test
Reminder: One-sample T Test
The difference μ1 - μ2
comes from the null hypothesis. In this type of test, we assume μ1 = μ2
in the population means, which results in μ1 - μ2 =0
.
Solve:
For the Degree of freedom in the Two-sample Test, we're gonna use the SMALLER sample size.
Solve:
t=2.12621542
df
, which will be the smaller sample size 46- 1 = 45
Ha: μ1 ≠ μ2
, so we're to calculate both tails:
▶︎ Jump back to previous note: Significance Testing
Normally, we can make conclusion simply by comparing P-value
with Significance level
.
But there're cases ask us to make conclusion by comparing Confidence level
with Significance level
.
In that case, we can judge it by simply examine whether the Confidence interval covers 0
or not.
Since Confidence Level + Significance Level = 100%
:
Solve:
P-value > ⍺
, means there's no sufficient evidence against the null hypothesis.Solve:
▶ Jump back to previous note on: Z Test vs. T Test (One-sample)
Refer to Khan academy: When to use z or t statistics in significance tests
▶ Jump back to previous note on: Z Test
▶ Jump back to previous note on: T Test
Proportion -> Z-test Mean -> T-test
One-sample Z test for proportion:
One-sample T test for mean:
Large sample size -> Z-test Small sample size -> T-test
▶ Jump back to previous note on: Significant Different Test (Proportions)
▶ Jump back to previous note on: Significant Different Test (Means)
Proportion -> Z-test Mean -> T-test
Two-sample Z test for proportion:
Two-sample T test for mean:
Z-score for CI:
T-score for CI:
Chi is written as greek letter
𝐗
, looks like x, reads "kai". Chi-squared is written as𝐗²
.
Like Z-score in a Normal Distribution for a Z-test, T-score in a T-Distribution for a T-test, 𝐗² is the "Test statistic" in Chi-square Test, which converts sample data to a standardized value in a Chi-square Distribution.
Chi-square Test is a hypothesis test for categorical data.
Refer to wiki: Chi-squared test Refer to Khan academy: Chi-square statistic for hypothesis testing Refer to Crash course: Chi-Square Tests: Crash Course Statistics #29
Solve:
How to understand the formula? It's not hard to see this a way to standardized the data:
Observed - Expected
gets the Distance,()²
eliminates the negative results,÷ Expected
unweights the data, so that the value will fit to the Standardized Distribution, similar to the concepts of Unit Circle or Unit Vector.Tests how well certain proportions fit our sample, which only has ONE variable(row).
Look to see whether being a member of ONE CATEGORY is independent of THE OTHER, which has TWO variables (rows).
It's looking at whether it's likely that Different samples come from the Same population.
Goodness-of-fit Test is good for testing a
One-row Frequency Table
. The test shows how well certain proportions fit our sample, which only has ONE variable(row).
Steps:
Expected Counts
for each dataChi-square value
𝐗² according to the Expected & ObservedP-value
in the distribution according to 𝐗²Counting the Expected Frequency for data is the very first step and fundamental part of doing Chi-square Test
.
The expected frequencies can be either a PRESET or the PROBABILITY of the data. The expected frequencies are set as Null Hypothesis in the test, and Observed frequencies are the Alternative Hypothesis against the null in the test.
"For a χ² goodness-of-fit test, the null hypothesis is that the population distribution of the categorical variable in question matches some hypothesized distribution. We use that hypothesized distribution to calculate the expected counts for each value of the variable."
Solve:
350/4 = 87.5
Expected count
for each feeder.Chi-squared Test statistic Formula:
To calculate 𝐗², we need to COMPLETE the Frequency Table, with both Expected and Observed values:
Or you can see it as:
To calculate P-value
we need the 𝐗² and DF:
For instance, it observes 3 prices for a fruit: prices of apple, orange, banana. Then there are 3 categories, or 3 variables. Therefore the DF (Degree of freedom) is (3-1)=2
Get an online chi-squared calculator
, input the test statistic 𝐗² and DF, we'll get its P-value, like this:
Solve:
𝐗²=10.5
and Degree of freedom df= 4-1 = 3
:
To overturn the null hypothesis, we just to compare the P-value
with Significance Level
.
But there's another type of conclusion we can make: which component contributes the most to the test statistic. The way to do it, is simply look at each component's value, the bigger component the more it contributes.
Solve: District B has the largest component because its observed count was farthest away from its expected count (relative to the expected count). So we can say that District B contributed the most to the 𝐗² test-statistic.
For the Chi-square homogeneity test we're gonna use this online calculator instead:
▶ Chi-Square Calculator
The
Ratio
can either beRowTotal / TableTotal
orColumnTotal / TableTotal
.
Solve:
65/262
in this case90
in this case, so it's (65/262) * 90
Degree of Freedom (DF) in the Chi-square Homogeneity Test
would be:
Solve:
Selecting appropriate hypotheses The chi-square statistic is such a versatile tool that we can use the exact same calculations to answer very different questions with it, depending on whether we draw our data from one sample or from independent samples or groups.
Ⓐ Multiple independent Sample groups A chi-square test can help us when we want to know whether different populations or groups are alike with regards to the distribution of a variable. Our hypotheses would look something like this:
We call this the chi-square test for Homogeneity
etc.,
Ⓑ One Sample group A chi-square test can help us see whether individuals from a sample who belong to a certain category are more likely than others in the sample to also belong to another category. Our hypotheses would look something like this:
We call this the chi-square test of association/independence
.
etc.,
Solve:
𝐗² > ɑ
, so we should fail to reject Ho.It can be concluded as L-I-N-E-R
:
L
: Linear condition (Has linear relationship between x&y )I
: Independent condition (Individual observations with replacement or 10% Rule)N
: Normal condition (Sample size is at least 30)E
: Equal variance conditionR
: Random conditionHere's the formula for estimating the slope:
Notice:
T-interval
for estimating slopen-2
Solve:
Here is the formula for T statistic for slope:
Solve:
▶︎ Jump back to previous note: Significance Testing
Normally, we can make conclusion simply by comparing P-value
with Significance level
.
But there're cases ask us to make conclusion by comparing Confidence level
with Significance level
.
In that case, we can judge it by simply examine whether the Confidence interval covers 0
or not.
Since Confidence Level + Significance Level = 100%
:
Solve:
z
t
z*
or z-value
t*
or t-value
z
or z-value
t
or t-value
There're two ways to get the statistic (z or t):
Confidence Level
to (z or t) value:
z = Ztable( (1-Confidence level)/2 )
t = Ttable( (1-Confidence level)/2, DF )
(Observed - Expected) / SE
The goal is to transform Non-linear relationship to Linear relationship, which is much easier to calculate and predict.
Refer to Khan academy: Transforming nonlinear data Refer to article: Non-Linear Transformation
There are several non-linear curves that that can be transformed into linear curves.
Solve:
Solve:
Study Resources
Tools
Khan academy AP Statistics
Course Challenge
◆Machine Learning related topics