Sampling from a Population

oldoc63 commented 1 year ago

In statistics, we often want to learn about a large population. Since collecting data for an entire population is often impossible, researchers may use a smaller sample of data to try to answer their questions.

To do this, a researcher might calculate a statistic such as mean or median for a sample of data. Then they can use that statistic as an estimate for the population value the really care about.

For example, suppose that a researcher wants to know the average weight of all Atlantic Salmon fish. It would be impossible to catch every single fish. Instead, the researchers might collect a sample of 50 fish off the coast of Nova Scotia and determine that the average weigh of those fish is x. If the same researchers collected 50 new fish and took the new average weight, that average would likely be slightly different that the first sample average.

We will go over how we can extrapolate from sample data in order to describe our uncertainty about the statistics of the full population.

oldoc63 commented 1 year ago

Random Sampling in Python

Now that we've generate some random samples from a population using an applet, let's code this ourselves in Python. The numpy.random package has several functions that we could use to simulate random sampling. In this exercise, we'll use the function np.random.choice(), which generates a sample of some size from a given array.

We'll pretend that we actually have a list of all the weights of Atlantic Salmon that currently exist.

In the example code we have done the following:

Loaded the weights of all salmon into a dataframe called population.
Plotted the distribution of population and calculated the mean.
Used np.random.choice() function to generate a sample called sample size of 30 (samp_size variable is equal to 30).

oldoc63 commented 1 year ago

Find the mean of the sample, round it to 3 decimal places, and assign it to a variable called sample_mean.
Plot the histogram of the sample data
Change the mean to smaller values

oldoc63 commented 1 year ago

As we saw in the last exercise, smaller sample sizes will have sample means that vary more from each other each time you take a random sample. With a small sample, extreme values can significantly impact the sample mean, causing it to vary from one sample to the next.

oldoc63 / learningDS

Sampling from a Population #433

Random Sampling in Python