nishantsingh93 / Movie-Revenue-Prediction-

1 stars 0 forks source link

Revenue Prediction of TMDB Movie dataset



I. Introduction:

Dataset: This Project aims at predicting revenue of movies using supervised learning approach. The TMDB dataset contains around 5000 movies and TV series. It is one of the biggest movie database on the web available. The dataset is divided into csv files and many columns are in json format.

Objective: Given the information about a movie such as release month, cast, budget, film review, director, production house, language can we predict the total gross revenue for that movie? However by analyzing revenues generated by previous movies, one can build a model which can help us predict the expected revenue for a movie. Such a prediction could bevery useful for the movie studio which will be producing the movie so they can decide on expenses like artist compensations, advertising, promotions, etc. accordingly. Plus investors can predict an expected return-on-investment. Also, it will be useful for movie theaters to estimate the revenues they would generate from screening a particular movie.



II. Data Preparation:

About the Dataset:

The Dataset is divided into 2 csv files tmdb_5000_credits.csv & tmdb_5000_movies.csv. The Major columns are:

Dataset Cleaning:



IV. Training Machine Learning Algorithms:

scatterplotmatrix


correlation plot


Linear Regression:

Result:

Diagnostic plots:

diagnostic plots Residuals vs Fitted: This plot shows that the residuals have non-linear patterns. There is a non-linear relationship between predictor variables and an outcome variable and the pattern shows up in this plot if the model doesn’t capture the non-linear relationship.

Normal Q-Q:A Q-Q plot compares the quantiles of a dataset and a set of theoretical quantiles from a probability distribution.Therefore it basically compares every observed value against a standard normal distribution with the same number of points. The graph is “skewed right,” meaning that most of the data is distributed on the left side with a long “tail” of data extending out to the right.

Scale Location: This plot is similar to the residuals versus fitted values plot, but it uses the square root of the standardized residuals. Like the first plot, there should be no discernable pattern to the plot.

Residuals vs Leverage: The Influence of an observation can be thought of in terms of how much the predicted scores would change if the observation is excluded. Cook’s Distance is a pretty good measure of influence of an observation. The leverage of an observation is based on how much the observation’s value on the predictor variable differs from the mean of the predictor variable. The more the leverage of an observation , the greater potential that point has in terms of influence. In this plot the dotted red lines are cook’s distance and the areas of interest for us are the ones outside dotted line on top right corner or bottom right corner. If any point falls in that region , we say that the observation has high leverage or potential for influencing our model is higher if we exclude that point. ts not always the case though that all outliers will have high leverage or vice versa. In this case observation #1 & #96 has high leverage and our choices are Justify the inclusion of #1 & #96 and keep the model as is, Include quadratic term as indicated by Residual vs fitted plot and remodel and Exclude observation #1 & #96 and remodel.


Ridge Regression:

cross validationrr

opt_lambda <- cv.ridge.out$lambda.min
opt_lambda

Lasso Regression:

Result: lasso regression

lasso regressionrr

Decision Trees:

Result: rt rt1


Random Forest:

Result: rf

picture1



V. Conclusion:

screen shot 2018-04-09 at 3 37 26 am

VI. Future Enhancements:



VII. References: