One-Sample T-Test in SciPy

oldoc63 commented 1 year ago

Introduction

In this lesson, we'll walk through the implementation of a one-sample t-test in Python. One-sample t-test are used for comparing a sample average to a hypothetical population average. For example, a one-sample t-test might be used to address questions such as:

Is the average amount of time that visitors spend on a website different from 5 minutes?
Is the average amount of money that customers spend on a purchase more than 10 USD?

As an example, let's imagine the fictional business BuyPie, which sends ingredients for pies to your household so you can make them from scratch. Suppose that a product manager wants online BuyPie orders to cost around 1000 Rupees on average. In the past day, 50 people made an online purchase and the average payment per order was less than 1000 Rupees. Are people really spending less than 1000 Rupees on average? Or is this the result of chance and a small sample size?

oldoc63 commented 1 year ago

We have provided a small data set called prices representing the purchase prices of customers to BuyPie.com in the past day. First, print out prices to the console and examine the numbers. How much variation is there in the purchase prices? Can you estimate the mean by looking at these numbers?

oldoc63 commented 1 year ago

Calculate the mean of prices using np.mean(). Store it in a variable called prices_mean and print it out.

oldoc63 commented 1 year ago

Implementing a One-Sample T-Test

In the last exercise, we inspected a sample of 50 purchase prices at BuyPie and saw that the average was 980 rupees. Suppose that we want to run a one-sample t-test with the following null and alternative hypotheses:

Null: The average cost of a BuyPie order is 1000 rupees.
Alternative: The average cost of a BuyPie order is not 1000 rupees.

Scipy has a function called ttest_1samp(), which performs a one-sample t-test for you. ttest_1samp() requires two inputs, a sample distribution (eg. the list of the 50 observed purchase prices) and a mean to test against (eg. 1000):

tstat, pval = ttest_1samp(sample_distribution, expected_mean)

The function uses your sample distribution to determine the sample size and estimate the amount of variation in the population -which are used to estimate the null distribution. It returns two outputs: the t-statistic and the p-value.

oldoc63 commented 1 year ago

Use ttest_1samp() to run the hypothesis test: null the average price is 1000 rupees; alternative the average price is not 1000 rupees.

oldoc63 commented 1 year ago

Print out pval to the console

oldoc63 commented 1 year ago

P-values are probabilities, so they should be between 0 and 1. This p-value is the probability of observing an average purchase price less than 980 or more than 1020 among a sample of 50 purchases. If you run the test correctly, you should see a p-value of 0.49 or 49%.

Given that the mean purchase price in this sample was 980, which is not very far from 1000, we probably expect this p-value to be relatively large. The only reason it COULD be small (eg., <.05) is if purchase prices had very little variation (eg., they were all within a few Rupees of 980). We can see from the data print out that this is not the case. Therefore, a p-value around 0.49 makes sense!

oldoc63 commented 1 year ago

Assumptions of a One Sample T-Test

When running any hypothesis test, it is important to know and verify the assumptions of the test. The assumptions of a one-sample-t-test are as follows:

The sample was randomly selected from the population

For example, if you only collect data for site visitors who agree to share their personal information, this subset of visitors was not randomly selected and may differ from the larger population.
The individual observations were independent

For example, if one visitor to BuyPie loves the apple pie they bought so much that they convince their friend to buy one too, those observations were not independent.
The data is normally distributed without outliers or the sample size is large (enough)

There are no set rules on what a 'large enough' sample size is , but a common threshold is around 40. For sample sizes smaller than 40, and really all samples in general, it's a good idea to make sure to plot a histogram of your data and check for outliers, multi-modal distributions (with multiple humps), or skewed distributions. If you see any of those things for a small sample, a t-test is probably not appropriate.

In general, if you run an experiment that violates (or possibly violates) one of these assumptions, you can still run the test and report the results -but you should also report assumptions that were not met and acknowledge that the test results could be flawed.

oldoc63 commented 1 year ago

Using plt.hist(), plot a histogram of prices and check whether the values are (approcimately) normally distributed.

oldoc63 commented 1 year ago

Review

You know how to implement a one-sample t-test in Python and verify the assumptions of the test.

One-sample t-test are used for comparing a sample mean to an expected population mean
A one-sample t-test can be implemented in Python using the SciPy ttest_1samp() function
Assuptions of a one sample t-test include:
- The sample was randomly drawn from the population of interest
- The observations in the sample are independent
- The sample size is large enough or the sample data is normally distributed

oldoc63 commented 1 year ago

As a final exercise, some data has been loaded for you with purchase prices for consecutive days at BuyPie. You can access the first day using daily_prices[0], the second day using daily_prices[1], etc.. To practice running a one-sample t-test and inspecting the resulting p-value, try the following:

Calculate and print out a p-value for day 1 where the null hypothesis is that the average purchase price was 1000 rupees and the alternative hypothesis is that the average purchase price was not 1000 rupees. Print out the p-value.
Run the same hypothesis tests for days 1-10 (the fastest way to do this is with a for-loop) and print out the resulting p-values. What's the smallest p-value you observe for those 10 days?
Try changing the null hypothesis so that the expected population mean that you're testing against is different from 1000. Try any numbers that you want. How do your p-values change?

oldoc63 / learningDS