Intro to Sampling Distributions Dance Party

oldoc63 commented 1 year ago

You are a DJ trying to make sure you are ready for a big party. You don´t have time to go through all the songs you can work with. Instead, you want to make sure that any sample of 30 songs from your playlist will get the party started.

The dataset we are using for this project can be found here.

A helper_function.py file is loaded along with the script file. This file contains functions that you will use throughout this project.

oldoc63 commented 1 year ago

Make helper functions

oldoc63 commented 1 year ago

Loading in the data

You will be working with a dataset called spotify_data.csv. In script.py, use the read_csv() pandas function to load in spotify_data.csv into a variable called spotify_data.

oldoc63 commented 1 year ago

Use the pandas .head() function to preview the spotify_data.

oldoc63 commented 1 year ago

For this project, we are going to focus on the tempo variable. This column gives the beats per minute (bpm) of each song in spotify.csv. The other columns in our dataset are:
- danceability
- energy
- instrumentalness
- liveness
- valences

For now, we are going to ignore these other columns. Create a variable called song_tempos that contains the tempo column data.

oldoc63 commented 1 year ago

Helper Functions

Let's investigate the helper functions we will use in the following sections. A file called helper_functions.py should be opened in the workspace for you. It contains three functions: choose_statistic(), population_distribution(), and sampling_distribution().

choose_statistic() allows us to choose a statistic we want to calculate for our sampling and population distributions. It contains two parameters:
- x: An array of numbers
- sample_stat_text: A string that tells the function which statistic to calculate on x. It takes on three values: "Mean", "Minimum" or "Variance".

population_distribution() allow us to plot the population distribution of a dataframe with one function call. It takes the following parameter:

population_data: the dataframe being passed into the function

sampling_distribution() allows us to plot a simulated sampling distribution of a statistic. The simulated sampling distribution is created by taking random samples of some size, calculating a particular statistic, and plotting a histogram of those sample statistics. It contains three parameters:

population_data: the dataframe being sampled from
samp_size: the size of each sample
stat: the specific statistic being meassured for each sample -either Mean, Minimum or Variance

oldoc63 commented 1 year ago

Sampling Distribution Exploration

Now that our data is loaded into script.py and we have gone over the functions in helper_functions.py let's start our sampling distributions exploration.

To start off, let's use the population_distribution() function to graph distribution of song_tempos.

oldoc63 commented 1 year ago

The population distribution is approximately normal with a little bit of right-skewness.

oldoc63 commented 1 year ago

Now let’s plot the sampling distribution of the sample mean with sample sizes of 30 songs. To do this, use the sampling_distribution() helper function.

oldoc63 commented 1 year ago

Compare your sampling distribution of the sample means to the population mean. The sample mean is an umbiased estimator of the population.

oldoc63 commented 1 year ago

Now let's plot the sampling distribution of the sample minimum with samples sizes of 30 songs.

oldoc63 commented 1 year ago

Compare your sampling distribution of the sample minimums to the population minimum to see that the sample minimum is a biased estimator.

oldoc63 commented 1 year ago

Now let's plot the sampling distribution of the sample variance with sample sizes of 30 songs.

oldoc63 commented 1 year ago

Go to helper_functions.py. Change np.var(x) to np.var(x, ddof=1). Adding this ddof=1 parameter will divide our input by n-1 instead of n, therefore applying the sample variance formula.

oldoc63 commented 1 year ago

Calculating Probabilities

First, calculate the population mean and population standard deviation of song_tempos. Save these values in to separate variables called population_mean and population_std.

oldoc63 commented 1 year ago

Use population_mean and population_std to calculate the standard error of the sampling distribution of the sample mean with a sample size of 30. Save this value in a variable called standard_error.

oldoc63 commented 1 year ago

You are afraid that if the average tempo of the songs you randomly select is less than 140 bpm that your party will not be enjoyable. Using population_mean and standard_error in a CDF, calculate the probability that the sample mean of 30 selected songs is less than 140 bpm. Print your result into the output terminal.

oldoc63 commented 1 year ago

You know the party will be truly epic if the randomly sampled songs have an average tempo of greater than 150 bpm. Using population mean and standard_error in a CDF, calculate the probability that the sample mean of 30 selected songs is greater than 150 bpm. Print your result in the output terminal.

oldoc63 / learningDS