Open oldoc63 opened 1 year ago
Let's return to the quiz question example from the previous exercise -we want to remove our quiz question from our website if the probability of a correct response is different from 70%. Suppose we collected data from 100 learners and ran a binomial hypothesis test with the following null and alternative hypotheses:
Assuming that we set a significance threshold of 0.05 for this test:
Whenever we run a hypothesis test using a significance threshold, we expose ourselves to making two different kinds of mistakes: type I errors (false positives) and type II errors (false negatives):
Null hypothesis | Is True | Is False |
---|---|---|
p_value significant | Type I Error | Correct! |
p_value not significant | Correct! | Type II Error |
Consider the quiz question hypothesis test described in the previous exercises:
Suppose, for a moment, that the true probability of a learner answering the question correctly is 70% (if we showed the question to all learners, exactly 70% would answer it correctly). This puts us in the first column of the table above (the null hypothesis is true). If we run a test and calculate a significant p-value, we will make type I error (also called a false positive because de p-value is falsely significant), leading us to remove the question when we don't need to.
On the other hand, if the true probability of getting the question correct is not 70%, the null hypothesis is false (the right-most column of our table). If we run a test and calculate a non-significant p-value, we make a type II error, leading us to leave the question on our site when we should have taken it down.
Suppose that the truth (which the researcher doesn't know) is: if every student took the test in an ergonomic chair, the average score for all test-takers would be 52 points.
Based in their sample of only 100 students, the researcher calculates a p-value of 0,07.
Set the right value for the outcome.
It turns out that, when we run a hypothesis test with a significance threshold, the significance threshold is equal to the type I error (false positive rate) for the test. To see this we can use a simulation.
Recall our quiz question example: the null hypothesis is that the probability of getting a quiz question correct is equal to 70%. We'll make a type I error if the null hypothesis is correct (the true probability of a correct answer is 70%), but we get a significant p-value anyways.
Now, consider the following simulation code:
This code does the following:
Note that the proportion of false positive tests is very similar to the value of the significance threshold (0.05).
While significance thresholds allow a data scientist to control the false positive rate for a single hypothesis test, this starts to break when performing multiple tests as part of a single study.
For example, suppose that we are writing a quiz at codecademy that is going to include 10 questions. For each question, we want to know whether the probability of a learner answering the question correctly is different from 70%. We now have to run 10 hypothesis tests, one for each question.
If the null hypothesis is true for every hypothesis test (the probability of a correct answer is 70% for every question) and we use a 0.05 significance level for each test, then:
To address this problem, it is important to plan research out ahead of time: decide what question you want to address and figure out how many hypothesis test you need to run. When running multiple tests, use a lower significance threshold (eg., 0.01) for each test to reduce the probability of making a type I error.
Sometimes, when we run a hypothesis test, we simply report a p-value or a confidence interval and give an interpretation (eg., the p-value was 0.05, which means that there is a 5% chance of observing two o fewer heads in 10 coin flips).
In other situations, we want to use our p-value to make a decision or answer a yes/no question. For example, suppose that we're developing a new quiz question at Codecademy and want learners to have a 70% chance of getting the question right (higher would mean the question is too easy, lower would mean the question is too hard). We show our quiz question to a sample of 100 learners and 60 of them get it right. Is this significantly different from our target of 70%? If so, we want to remove the question and try to rewrite it.
In order to turn a p-value, which is a probability, into a yes or no answer, data scientist often use a pre-set significance threshold. The significance threshold can be any number between 0 and 1, but a common choice is 0.05. P-values that are less than this threshold are considered "significant", while larger p-values are considered "not significant".