oldoc63 / learningDS

Learning DS with Codecademy and Books
0 stars 0 forks source link

Welcome to Inferential Statistics #454

Open oldoc63 opened 1 year ago

oldoc63 commented 1 year ago

We will cover a few hypothesis tests that can be used to make inferences about populations. We'll cover ways of implementing these hypothesis tests in Python and simulate some hypothesis tests without pre-built functions.

oldoc63 commented 1 year ago

Why is this important?

Descriptive statistics and inferential statistics are two subfields of statistics. Descriptive statistics include numerical and visual summaries of data. Hypothesis testing, on the other hand, is a form of inferential statistics, which is used to draw inferences about a population using a smaller sample of data.

This is important because descriptive statistics can tell us about the data that we have, but sometimes we can't collect all of the data that we need to answer our questions. For example, maybe we want to know whether people who get a vaccine are less likely to get a disease. We can't vaccinate every single person in the world to test this, so we'll have to vaccinate a samaller sample of people instead. Then, if the vaccine seems to work in our sample, we need to know whether that could have been a random fluke -or if it's likely to be true for the rest of the population. This is where hypothesis testing can help.

At the end, we will be able to:

oldoc63 commented 1 year ago

Descriptive vs Inferential

Descriptive and inferential statistics are two subfields od the larger field of statistics. Each is used for distinct purposes. In this article, we will introduce and explore some of the methods associated with eache one.

Descriptive Statistics

Descriptive statistics is all about summarizing data. It is useful for making large amounts of information into a n interpretable subset of numbers and/or visualizations. Imagine a long an complex spreadsheet; we could not be able to easily understand the data (any trends, patterns, or meaningful summaries) just by looking over the rows and columns. However, with descriptive statistics, we are able to distill that information into numbers and visualizations that we can make sense of.

Commonly used descriptive statistics include the average, median, frequency, standard deviation, and range of a set of values. Tjese numeric descriptive statistics can also be displayed as avisual representations, sucha as tables and graphs.

Example: Sales Company

Suppose you work at a large company and you have been given a dataset of sales information from the past month. You could use descriptive statistics to turn the dataset into a one-page report or table that will be more readable and provide more information than the raw data. Some potential descriptive statistics in the report could include:

Average # sales per day / Total sales of Each Product this month

Image

The data is easily interpretable in this form. Instead of trying to make sense of a spreadsheet of the raw data, we can learn specific pieces of information very easily. In this example, we can see that Product 3 is sold the least while Product 7 is sold the most.

oldoc63 commented 1 year ago

Inferential Statistics

Inferential Statistics is all about using a sample (a subset of a population) to make inferences about a larger population of interest. This is useful when we want to know something about a population but cannot observe every member -often due to time, feasibility , or monetary constrains. Some methods that are used in inferential statistics include hypothesis testing and regresion.

The key to inferential statistics is understanding that samples do not always accurately reflect the population they came from. A large part of inferential statistics is quantifying our uncertainty about a population by looking at a smaller sample.

For example, the population shown below is made up of 10 blue dots and 5 red dots, which means that two thirds od the population is blue and one third is red. Suppose we take a random sample of 3 dots and want to use that sample to estimate porportion of the population that is blue. If we're lucky, we'll sample two blue dots and one red dot, as shown on the left, which would accurately represent the proportions of blue to red dots in the population. However, we could also randomly sample 3 blue dots, or 2 red dots and 1 blue dot -both of wich do not match the population. Inferential statistics allow us to look at a sample and then quantify our uncertainty about how similar (or different) the entire population might be.

Image

oldoc63 commented 1 year ago

Example: Customer Contacting

Suppose you work at a sales company that is interested in testing two different customer contacting methods to see if one leads to a higher response rate than the other. It is impossible to test both methods wiht the entire population of every single past, present, and future customer. Instead, you could take a sample of 1000 customers and randomly assign them to either a text contacting system or a phone calling system. After one month, you could then calculate the difference in response rate (a descriptive statistic) for the two sampled groups.

Suppose you find that the customers who received a text were 12% more likely to respond than the customers who received a phone call. This is a descriptive statistic about the sample -but what you really want to know is: if you have sample the whole population of customers, would you still have found at least a 12% difference in response rate?

This is where inferential statistics methods come in handy. For example, you could use a hypothesis test to estimate the probability that, in the full population, you will observe a higher response rate for texts compared to calls given that you observed a higher rate in your sample.

oldoc63 commented 1 year ago

Example: Test Scores

Suppose you are a researcher studying the relationship between high school student's homework grades and standardized test scores. It would be very difficult and expensive to collect information on homework grades and standardized test scores for every single high school student in the world. Instead, you could find a random sample of students and inspect the relationship between homework grades and standardized test scores among that sample. Finally, you could use a regression analysis (anothe inferential statistical method) to understand whether a similar reationship is likely to exist in the larger population of all students.