ixxmu / mp_duty

抓取网络文章到github issues保存
https://archives.duty-machine.now.sh/
122 stars 30 forks source link

Will Netflix Renew the Show? | R-bloggers #243

Closed ixxmu closed 4 years ago

ixxmu commented 4 years ago

https://www.r-bloggers.com/will-netflix-renew-the-show/

ixxmu commented 4 years ago

netflix 数据预测

github-actions[bot] commented 4 years ago

Will Netflix Renew the Show? | R-bloggers by

Will Netflix Renew the Show?

[This article was first published on R – Hi! I am Nagdev, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In last couple of years, Netflix has become a part of my lifestyle. At the end of my day when I turn on my TV, by default i’m tuned to check out Netflix. I always look forward for Friday when they release their original content and make sure I binge them by the end of my weekend. My wife and I recently binged their reality TV show called “Indian Matchmaking“. Honestly, it was binge-worthy. Me and couple of friends have been talking quite a lot about this show and season 2. We also have been following them on social media. During our conversation, I got curious to see if Netflix would renew the show for season 2. Since, the show was recently put out, Netflix would not comment on the renewal for at least couple of weeks or months. This has been their pattern for quite sometime.

Next thing a data scientist would do in his right mind is to “data science it”. That’s exactly what I did. I built a prediction model to predict if Netflix would renew the show or not. Let’s get to the basics

Step 1: Data identification

Firstly, I needed to make a compilation of all the shows Netflix has renewed or cancelled in the past. I was able to get a brief list through a google search and stumbled up Business Insider. They had compiled a list of shows that were cancelled. I eliminated all Marvel/Disney shows in my data collection because “Disney+” please!.

Step 2: Data collection

Trends

One of the main factors that contribute for Netflix renewing the show is the view count. Netflix does not publish these numbers publicly with few exceptions. I identified all the shows, their season and release dates. With this compiled list, I was able to grab google trend data for each of this show. I looked at google trend for 30 days since the release of the show as one of my primary variable. Below is an example on how you could get the data from google trends.

Reviews

Netflix is one of those companies that make shows or movies irrespective of what critics think. I am a little biased here. I usually believe in user reviews over critics. Just for this study, I did include both user and critic scores from rotten tomatoes

Release

There are different aspects to when a show is release. For example if it’s release on a Friday or Tuesday. Also, we need to consider what month or season the show was released. That’s exactly what I did. I captured the day show was released, month and season of the year as my third variable.

Time between releases

Final variable that I considered was, the time between each subsequent season releases. This variable was used mostly as a test hypothesis to see if this affected in any way.

Here is the link to the file if you want to download the data.

Step 3: Data Analysis

Missing Values

While I was collecting data, I made a mistake of not collecting trend data for 21 days for a particular title. I rather collected it for 19 days. Rather than going back and collecting it, I imputed the vales to mean. I also removed the show title name as it adds no value to the analysis.

I have also split the data into modeling data and final. Data contains information for training model and final is our data we will use in the end to predict if the show “Indian Matchmaking” will be renewed or not.

library(caret)library(dplyr)library(ModelMetrics)# load datadata = read.csv("C:\\Users\\aanamruthn.NADENSO\\Downloads\\nf.csv")data = data %>% mutate_all(~ifelse(is.na(.x), mean(.x, na.rm = TRUE), .x))data = data %>% select(-c("Show", "release_date"))# convert variable to factordata$renewed = as.factor(data$renewed)# 2 is "yes" 1 is "no"# split model building and predictionfinal = data[42, ]data = data[1:41,]

Step 4: Modeling Simulation

Model simulation is a process that I have traditionally followed for a while now. In this process, we will build a series of 30 models with random seed values and train-test splits. We will also tune the models through a grid search. The main reason for doing this is that, we want to make sure we get near consistent results rather than a 1 time luck. As for the model, we will use linear SVM.

# function to train modelsbuild_models = function(i){# set random seed valuesset.seed(sample(1:1000, 1))# create data partitionsplit = runif(1, 0.6, 0.9)samp = createDataPartition(y = data$renewed, p = split)x = data[samp$Resample1, ]y = data[-samp$Resample1, ]#10 folds repeat 3 timescontrol = trainControl(method='repeatedcv',number=10,repeats=5)#Metric compare model is Accuracymetric = "Accuracy"set.seed(sample(1:1000, 1))# SVM cost gridsvmgrid = expand.grid(cost=seq(0.05,1,by = 0.05))# train the modelmodel = train(renewed~.,data=x,method='svmLinear2',metric='Accuracy',tuneGrid=svmgrid,trControl=control,importance = TRUE,preProcess= c("center", "scale"))# print the results of the modelprint(model)# get the metrics for test settest = caret::confusionMatrix(factor(y$renewed), predict(model, y))# return the simulation resultsreturn(data.frame(train_accuracy = max(model$results$Accuracy),test_accuracy = as.numeric(test$overall[1]),AUC = auc(factor(y$renewed), predict(model, y)),train_split = split))}# build the model 30 timessim_Results = do.call(rbind, lapply(1:30, build_models))

The simulation results are as shown below.

> sim_Resultstrain_accuracy test_accuracy       AUC train_split1       0.7966667     0.6923077 0.6375000   0.66611552       0.7900000     0.7333333 0.7500000   0.60291803       0.8366667     0.7000000 0.7083333   0.73343064       0.7716667     0.6000000 0.5416667   0.72957235       0.8683333     0.6428571 0.6777778   0.62955576       0.7850000     0.8571429 0.8888889   0.63218447       0.7366667     0.8000000 0.7916667   0.73987968       0.8633333     0.6363636 0.6071429   0.69322219       0.8416667     0.7500000 0.8000000   0.760131710      0.8633333     0.6666667 0.6944444   0.618469411      0.8500000     0.5384615 0.5125000   0.679620512      0.7950000     0.8000000 0.7500000   0.874721313      0.7433333     0.7500000 0.8000000   0.776800614      0.8093333     0.6000000 0.5833333   0.849084515      0.7506667     0.8000000 0.8333333   0.852900416      0.7000000     0.6000000 0.5555556   0.616819217      0.7936667     0.8333333 0.8750000   0.822174718      0.7333333     0.7692308 0.7375000   0.659746119      0.8150000     0.5833333 0.5571429   0.685532420      0.7966667     0.8000000 0.8333333   0.747710521      0.7473333     1.0000000 1.0000000   0.895120122      0.6933333     0.8000000 0.7500000   0.609090323      0.7683333     0.7500000 0.7285714   0.682431824      0.7333333     0.9000000 0.9166667   0.721861725      0.7066667     0.5384615 0.5125000   0.661931126      0.7380000     0.8000000 0.8333333   0.845545227      0.8180000     0.6000000 0.6666667   0.850080028      0.7783333     0.8000000 0.7916667   0.738531629      0.7863333     0.8333333 0.8750000   0.832022430      0.7733333     0.7777778 0.8333333   0.7509293

From the above simulation results we can notice that the results are pretty consistent with train and test sets. It might be quite a lot to view all the model building results at once. So, lets do a summary of the results for easier understanding as shown below. Here we see that the avg train accuracy is around 78% and test accuracy is 73% which is not too bad for a model with smaller sample size. The AUC is around 73% as well.

sim_Results %>% summarize(avg_train_accuracy = mean(train_accuracy),avg_test_accuracy = mean(test_accuracy),avg_AUC = mean(AUC))avg_train_accuracy avg_test_accuracy   avg_AUC1          0.7827778         0.7317534 0.7347619

Step 5: Variable Importance

We currently have an acceptable model for predicting if the show will be renewed or not. Just out of curiosity, we also need to know the most significant factors for this predicting this/or the most important variable contribution for prediction. We will use caret’s feature to do this. The function automatically scales the importance scores to be between 0 and 100. Since we used SVM classification model, the default behavior is to compute the area under the ROC curve to get variable importance.

From the below results we see the top 20 important features. Season was the most important factor followed by the days between season, Day.6 trend on google searcher and critic_score.

# variable importance for the modelvarImp(model)ROC curve variable importanceonly 20 most important variables shown (out of 38)ImportanceSeason                   100.00days_between_seasons      75.61Day.6                     72.76critic_score              53.66Day.7                     50.00Day.5                     46.34Day.24                    45.12Month                     40.65Season.1                  30.89Day.8                     28.46Day.13                    27.24Day.12                    25.61Day.2                     24.80Day.18                    23.58Release_day               22.36Day.23                    17.89Day.21                    17.48Day.15                    16.26Day.20                    14.63Day.0                     14.63

Step 6: What we have all been waiting for!

Finally, we will use our show data to predict if “Indian Matchmaking” show would be renewed or not. From the prediction results, we can see that the show is most likely to be renewed for season 2.

# predict the resultspredict(model, final)# [1] 2# 2 is "yes" 1 is "no"

Reflection Points

If I would go back and work on this model, there are three things I would definitely do differently.

  1. Data collection: I would add more variables like genre, budget, country, twitter trends and probably increase my sample size from few shows to at least 30.
  2. Feature extraction: performing a polynomial expansion on the data would be a great expansion of features.
  3. Modeling: I would train of different types of models such as trees, boosting, neural networks etc and compare the results.

Conclusion

I honestly don’t know what procedure Netflix uses to renew their shows. From the above analysis we can notice that we are not that off. If we could tie into Netflix’s viewership and monthly trend data we could definitely build a more accurate model that predicts if Neflix should renew the show or not.

Let me know what you think is a good predictor for predicting Netflix renewals.

If you like this post, do check out my other posts.

The post Will Netflix Renew the Show? appeared first on Hi! I am Nagdev.

To leave a comment for the author, please follow the link and comment on their blog: R – Hi! I am Nagdev.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.