Open oldoc63 opened 2 years ago
You just got a very cool job in the film industry. Your first assignment is to do some research on the content produced by streaming services in the last few years. You're putting together information for a report on the kinds of films being produced as well as any patterns that might be worth exploring further.
You decide to start your research by exploring some data about films and documentaries produced by Netflix. The dataset you'll be using is a modified version of one found on the website Kaggle. Your dataset includes 503 films with the following variables:
You decide to start with the language variable. The table that follows gives the count of films in each language.
https://static-assets.codecademy.com/Courses/data-literacy/stats/project/language.svg
Using the table, try answering the following individual questions. Then use your answers to write up a brief summary about the primary languages of the films.
There are 360 films with English as the primary language. To get the proportion, we divide 360 by the total of 503 films: 360/503 = 0.72. About 0.72 of the films have English as their primary language.
We know there are 360 English-language films, but we have to do a little work to find the number of other single-language films. We can add the four categories for Spanish, Hindi, French, and "other single" (29 + 27 + 15 + 51 = 122). Or we can subtract the English and "multiple" categories from the total (503 - 360 - 21 = 122). This means the ratio of English films to non-English single-language films is 360 to 122. Since 360 ÷ 122 is 2.95, this means there are almost 3 English films for every non-English film.
There are 21 films that have multiple primary languages. Dividing 21 by 503 gives a proportion of 0.04.
The distribution of IMDb scores is mostly symmetrical in a bell shape, indicating a normal distribution. There are a couple of very low scores, but they are not far from the rest of the distribution, so they may not be extreme enough to be considered outliers. Since the distribution is fairly symmetrical, we can rely on the mean of 6.3 to give us a good idea of what a typical IMDb rating is. With a standard deviation of 1, we know there is some variation in scores, but most scores fall between 4 and 8 on the 1-10 scale.
Genre is a categorical variable — this variable gives information about a quality of the films that is non-numeric. We can describe categorical variables using frequencies, proportions, and ratios.
Our colleague might create a table showing the different genre categories, the count of films in each category (frequency), and the percentage of the total this count represents (proportion). Our colleague might also compare counts of genres to one another using ratios.
Your colleague used analytics software to create a summary of the runtime variable. The software program has a default setting for numeric variables that outputs a distribution plot, the mean, and the standard deviation. The analytics software produced the following plot and statistics for the runtimes:
https://static-assets.codecademy.com/Courses/data-literacy/stats/project/runtime-distribution.svg
There are two aspects of this distribution plot that might lead to concern about using the mean and standard deviation:
The distribution is left-skewed — it has a long tail of low values on the left side. These values might influence the mean to be lower.
There is a single high value of just above 200 minutes. This value might be an outlier that influences the mean to be higher.
We could use statistics that are more robust to outliers and skewness, such as the median and interquartile range (IQR). The median is the middle value and the IQR is the range of the middle 50% of the data (Q3 - Q1).
Mean: 92.5
Standard Deviation: 28.4
The mean describes a typical runtime as in the low 90s. The standard deviation describes the distribution as having wide variability, with runtimes an average of almost 30 minutes different than the mean. These measurements are not wrong, but they don’t help us do a good job of summarizing what we’re seeing in the distribution.
Median: 97.0
IQR: 21.8
In contrast, the median describes a higher runtime as most typical. The low IQR indicates that half the values aren’t very far from the center value. These descriptions better match the large number of values near 100 that we see in the distribution plot.
Since the mean is less than the median, it seems like the left-skew is more influential on the mean than the high potential outlier is.
https://static-assets.codecademy.com/Courses/data-literacy/stats/project/imdbscore-genre.svg Most of the mean and standard deviation pairs are not far from the overall mean and standard deviation of all IMDb scores (6.3 and 1.0). However, there are a couple of patterns that stand out.
The “Romance/Romantic Comedy” genre has the lowest standard deviation at 0.6. This may indicate this genre was pretty consistent in getting scores close to the mean of 5.9.
The “Action/Sci-Fi” and “Comedy” genres had similar mean scores to “Romance/Romantic Comedy” but a wider spread of scores.
The “Documentary” genre had the highest mean IMDb score. Since its standard deviation isn’t particularly large, this may indicate Netflix documentaries tended to rate well fairly consistently.
You are wondering if there are any differences in the length of films across languages.
https://static-assets.codecademy.com/Courses/data-literacy/stats/project/runtime-language.svg
There are some interesting differences among the means and standard deviations in the table.
https://static-assets.codecademy.com/Courses/data-literacy/stats/project/scatter-scrn.svg
The plot does not show any linear relationship between the two variables. The plot mainly shows a cloud of points that aren’t close to the shape of a line. Lower runtimes aren’t associated with particularly low or high IMDb scores. Higher runtimes aren’t associated with particularly low or high IMDb scores.
Most of the films have runtimes between 50 and 150 minutes with varying IMDb scores between about 6.0 and 8.0 across those runtimes in no particular pattern.
The correlation coefficient of 0.92 indicates a very strong, positive linear relationship between runtimes and IMDb score. Shorter films have lower IMDb scores and longer films have higher IMDb scores. Since 0.92 is so close to 1, we conclude this pattern holds very strongly with little variation. But this is incorrect.
Overview
Explore Netflix data with yoir new understanding of summary statistics!
In this project, you'll practice using summary statistics from real data. You will: