oldoc63 / learningDS

Learning DS with Codecademy and Books
0 stars 0 forks source link

Netflix #247

Open oldoc63 opened 2 years ago

oldoc63 commented 2 years ago

Overview

Explore Netflix data with yoir new understanding of summary statistics!

In this project, you'll practice using summary statistics from real data. You will:

oldoc63 commented 2 years ago

Motivation

You just got a very cool job in the film industry. Your first assignment is to do some research on the content produced by streaming services in the last few years. You're putting together information for a report on the kinds of films being produced as well as any patterns that might be worth exploring further.

oldoc63 commented 2 years ago

Dataset

You decide to start your research by exploring some data about films and documentaries produced by Netflix. The dataset you'll be using is a modified version of one found on the website Kaggle. Your dataset includes 503 films with the following variables:

oldoc63 commented 2 years ago

Individual Variables

You decide to start with the language variable. The table that follows gives the count of films in each language.

https://static-assets.codecademy.com/Courses/data-literacy/stats/project/language.svg

Using the table, try answering the following individual questions. Then use your answers to write up a brief summary about the primary languages of the films.

There are clearly a lot of films that have English as their primary language. Of the 503 films, what proportion have English as their primary language?

oldoc63 commented 2 years ago

There are 360 films with English as the primary language. To get the proportion, we divide 360 by the total of 503 films: 360/503 = 0.72. About 0.72 of the films have English as their primary language.

oldoc63 commented 2 years ago

What is the ratio of English-language films to films in a single language that is NOT English?

oldoc63 commented 2 years ago

We know there are 360 English-language films, but we have to do a little work to find the number of other single-language films. We can add the four categories for Spanish, Hindi, French, and "other single" (29 + 27 + 15 + 51 = 122). Or we can subtract the English and "multiple" categories from the total (503 - 360 - 21 = 122). This means the ratio of English films to non-English single-language films is 360 to 122. Since 360 ÷ 122 is 2.95, this means there are almost 3 English films for every non-English film.

oldoc63 commented 2 years ago

What proportion of the films has multiple primary languages?

oldoc63 commented 2 years ago

There are 21 films that have multiple primary languages. Dividing 21 by 503 gives a proportion of 0.04.

oldoc63 commented 2 years ago

Using the plot and summary statistics, describe the distribution of IMDb scores.

oldoc63 commented 2 years ago

https://static-assets.codecademy.com/Courses/data-literacy/stats/project/imdb-distribution.svg

oldoc63 commented 2 years ago

The distribution of IMDb scores is mostly symmetrical in a bell shape, indicating a normal distribution. There are a couple of very low scores, but they are not far from the rest of the distribution, so they may not be extreme enough to be considered outliers. Since the distribution is fairly symmetrical, we can rely on the mean of 6.3 to give us a good idea of what a typical IMDb rating is. With a standard deviation of 1, we know there is some variation in scores, but most scores fall between 4 and 8 on the 1-10 scale.

oldoc63 commented 2 years ago

Which summary statistics might your colleague include in their summary of the film genres?

oldoc63 commented 2 years ago

Genre is a categorical variable — this variable gives information about a quality of the films that is non-numeric. We can describe categorical variables using frequencies, proportions, and ratios.

Our colleague might create a table showing the different genre categories, the count of films in each category (frequency), and the percentage of the total this count represents (proportion). Our colleague might also compare counts of genres to one another using ratios.

oldoc63 commented 2 years ago

Runtime

Your colleague used analytics software to create a summary of the runtime variable. The software program has a default setting for numeric variables that outputs a distribution plot, the mean, and the standard deviation. The analytics software produced the following plot and statistics for the runtimes:

https://static-assets.codecademy.com/Courses/data-literacy/stats/project/runtime-distribution.svg

oldoc63 commented 2 years ago

There are two aspects of this distribution plot that might lead to concern about using the mean and standard deviation:

The distribution is left-skewed — it has a long tail of low values on the left side. These values might influence the mean to be lower.
There is a single high value of just above 200 minutes. This value might be an outlier that influences the mean to be higher.
oldoc63 commented 2 years ago

Which alternative statistics could your colleague use in this case?

We could use statistics that are more robust to outliers and skewness, such as the median and interquartile range (IQR). The median is the middle value and the IQR is the range of the middle 50% of the data (Q3 - Q1).

oldoc63 commented 2 years ago

Mean: 92.5

Standard Deviation: 28.4

The mean describes a typical runtime as in the low 90s. The standard deviation describes the distribution as having wide variability, with runtimes an average of almost 30 minutes different than the mean. These measurements are not wrong, but they don’t help us do a good job of summarizing what we’re seeing in the distribution.

Median: 97.0

IQR: 21.8

In contrast, the median describes a higher runtime as most typical. The low IQR indicates that half the values aren’t very far from the center value. These descriptions better match the large number of values near 100 that we see in the distribution plot.

Since the mean is less than the median, it seems like the left-skew is more influential on the mean than the high potential outlier is.

oldoc63 commented 2 years ago

Describe what you learn about IMDb scores across genres from the means and standard deviations in the table.

https://static-assets.codecademy.com/Courses/data-literacy/stats/project/imdbscore-genre.svg Most of the mean and standard deviation pairs are not far from the overall mean and standard deviation of all IMDb scores (6.3 and 1.0). However, there are a couple of patterns that stand out.

The “Romance/Romantic Comedy” genre has the lowest standard deviation at 0.6. This may indicate this genre was pretty consistent in getting scores close to the mean of 5.9.
The “Action/Sci-Fi” and “Comedy” genres had similar mean scores to “Romance/Romantic Comedy” but a wider spread of scores.
The “Documentary” genre had the highest mean IMDb score. Since its standard deviation isn’t particularly large, this may indicate Netflix documentaries tended to rate well fairly consistently.
oldoc63 commented 2 years ago

You are wondering if there are any differences in the length of films across languages.

https://static-assets.codecademy.com/Courses/data-literacy/stats/project/runtime-language.svg

There are some interesting differences among the means and standard deviations in the table.

oldoc63 commented 2 years ago

Runtime and IMDb Score

https://static-assets.codecademy.com/Courses/data-literacy/stats/project/scatter-scrn.svg

The plot does not show any linear relationship between the two variables. The plot mainly shows a cloud of points that aren’t close to the shape of a line. Lower runtimes aren’t associated with particularly low or high IMDb scores. Higher runtimes aren’t associated with particularly low or high IMDb scores.

Most of the films have runtimes between 50 and 150 minutes with varying IMDb scores between about 6.0 and 8.0 across those runtimes in no particular pattern.

oldoc63 commented 2 years ago

The correlation coefficient of 0.92 indicates a very strong, positive linear relationship between runtimes and IMDb score. Shorter films have lower IMDb scores and longer films have higher IMDb scores. Since 0.92 is so close to 1, we conclude this pattern holds very strongly with little variation. But this is incorrect.