oldoc63 / learningDS

Learning DS with Codecademy and Books
0 stars 0 forks source link

Intro to Sampling Distributions Dance Party #453

Open oldoc63 opened 1 year ago

oldoc63 commented 1 year ago

You are a DJ trying to make sure you are ready for a big party. You don´t have time to go through all the songs you can work with. Instead, you want to make sure that any sample of 30 songs from your playlist will get the party started.

The dataset we are using for this project can be found here.

A helper_function.py file is loaded along with the script file. This file contains functions that you will use throughout this project.

oldoc63 commented 1 year ago

Make helper functions

oldoc63 commented 1 year ago

Loading in the data

  1. You will be working with a dataset called spotify_data.csv. In script.py, use the read_csv() pandas function to load in spotify_data.csv into a variable called spotify_data.
oldoc63 commented 1 year ago
  1. Use the pandas .head() function to preview the spotify_data.
oldoc63 commented 1 year ago
  1. For this project, we are going to focus on the tempo variable. This column gives the beats per minute (bpm) of each song in spotify.csv. The other columns in our dataset are:
    • danceability
    • energy
    • instrumentalness
    • liveness
    • valences

For now, we are going to ignore these other columns. Create a variable called song_tempos that contains the tempo column data.

oldoc63 commented 1 year ago

Helper Functions

  1. Let's investigate the helper functions we will use in the following sections. A file called helper_functions.py should be opened in the workspace for you. It contains three functions: choose_statistic(), population_distribution(), and sampling_distribution().

    choose_statistic() allows us to choose a statistic we want to calculate for our sampling and population distributions. It contains two parameters:

    • x: An array of numbers
    • sample_stat_text: A string that tells the function which statistic to calculate on x. It takes on three values: "Mean", "Minimum" or "Variance".

population_distribution() allow us to plot the population distribution of a dataframe with one function call. It takes the following parameter:

sampling_distribution() allows us to plot a simulated sampling distribution of a statistic. The simulated sampling distribution is created by taking random samples of some size, calculating a particular statistic, and plotting a histogram of those sample statistics. It contains three parameters:

oldoc63 commented 1 year ago

Sampling Distribution Exploration

  1. Now that our data is loaded into script.py and we have gone over the functions in helper_functions.py let's start our sampling distributions exploration.

To start off, let's use the population_distribution() function to graph distribution of song_tempos.

oldoc63 commented 1 year ago

The population distribution is approximately normal with a little bit of right-skewness.

oldoc63 commented 1 year ago
  1. Now let’s plot the sampling distribution of the sample mean with sample sizes of 30 songs. To do this, use the sampling_distribution() helper function.
oldoc63 commented 1 year ago
  1. Compare your sampling distribution of the sample means to the population mean. The sample mean is an umbiased estimator of the population.
oldoc63 commented 1 year ago
  1. Now let's plot the sampling distribution of the sample minimum with samples sizes of 30 songs.
oldoc63 commented 1 year ago
  1. Compare your sampling distribution of the sample minimums to the population minimum to see that the sample minimum is a biased estimator.
oldoc63 commented 1 year ago
  1. Now let's plot the sampling distribution of the sample variance with sample sizes of 30 songs.
oldoc63 commented 1 year ago

Image

oldoc63 commented 1 year ago
  1. Go to helper_functions.py. Change np.var(x) to np.var(x, ddof=1). Adding this ddof=1 parameter will divide our input by n-1 instead of n, therefore applying the sample variance formula.
oldoc63 commented 1 year ago

Calculating Probabilities

  1. First, calculate the population mean and population standard deviation of song_tempos. Save these values in to separate variables called population_mean and population_std.
oldoc63 commented 1 year ago
  1. Use population_mean and population_std to calculate the standard error of the sampling distribution of the sample mean with a sample size of 30. Save this value in a variable called standard_error.
oldoc63 commented 1 year ago
  1. You are afraid that if the average tempo of the songs you randomly select is less than 140 bpm that your party will not be enjoyable. Using population_mean and standard_error in a CDF, calculate the probability that the sample mean of 30 selected songs is less than 140 bpm. Print your result into the output terminal.
oldoc63 commented 1 year ago
  1. You know the party will be truly epic if the randomly sampled songs have an average tempo of greater than 150 bpm. Using population mean and standard_error in a CDF, calculate the probability that the sample mean of 30 selected songs is greater than 150 bpm. Print your result in the output terminal.