Revenue Prediction of TMDB Movie dataset

I. Introduction:

Dataset: This Project aims at predicting revenue of movies using supervised learning approach. The TMDB dataset contains around 5000 movies and TV series. It is one of the biggest movie database on the web available. The dataset is divided into csv files and many columns are in json format.

Objective: Given the information about a movie such as release month, cast, budget, film review, director, production house, language can we predict the total gross revenue for that movie? However by analyzing revenues generated by previous movies, one can build a model which can help us predict the expected revenue for a movie. Such a prediction could bevery useful for the movie studio which will be producing the movie so they can decide on expenses like artist compensations, advertising, promotions, etc. accordingly. Plus investors can predict an expected return-on-investment. Also, it will be useful for movie theaters to estimate the revenues they would generate from screening a particular movie.

II. Data Preparation:

About the Dataset:

The Dataset is divided into 2 csv files tmdb_5000_credits.csv & tmdb_5000_movies.csv. The Major columns are:

title(String)
budget(Numeric)
genres(String)
homepage(String)
id(Numeric)
keywords(String)
original_language(String)
original_title(String)
overview(String)
popularity(Numeric)
production_companies(String)
production_countries(String)
release_date(DateTime)
revenue(Numeric)
runtime(Numeric)
spoken_languages(String)
status(String)
tagline(String)
title(String)
vote_average(Numeric)
vote_count(Numeric)
movie_id(Numeric)
cast(String)
crew(String)

Dataset Cleaning:

So I have first tried creating different dataframes for extracting data from the json object. I use jsonlite library to extract the data.
Combining the keywords,genres, production_companies, production countries & spoken languages of a movie in a single column.
Dropping existing unformatted columns in the main dataset, creating a new dataset "movies"
Adding new columns gross and gross_flag for the purpose of exploratory analysis
Drop useless columns( 'homepage','overview','status','title','tagline','original_title', 'original_language', 'spoken_languages', 'production_countries' )
Extracting month of release date; put into new column
From Linear Regression using each of 12 months as dummy categories, we saw that months 5, 6, 11 and 12 are important, while the rest are unimportant. We will aggregate this as 'holiday month' (beginning of Summer; beginning of Winter)
Select Unique Countries & making dummy variables on top 6 frequent countries on the list.
Converting genre into dummy columns
Spliting JSON for cast, crew to select director, actors & actoress
Downloading List of top 100 directors, actors and converting the directors, actors into dummy varible based on the condition that if director/actor of the film in the top 100 list then assign 1 else 0
Assigning Dummy variable for Gender
Converting binary columns to categorical variables
Scaleing the dataset for uniformity

III. Exploratory Analysis:

IV. Training Machine Learning Algorithms:

scatterplotmatrix

Scatterplot matrices are a great way to roughly determine if we have a linear correlation between multiple variables.
The closer the data points come when plotted to making a straight line, the higher the correlation between the two variables, or the stronger the relationship
The variables have a negative correlation in this dataset and don't seem to be a good data set for Linear regression
But Scatterplot matrices are not so good for looking at discrete variables
In our dataset we can see that the variables have non linear correlation

correlation plot

Correlogram is a graph of correlation matrix. It is very useful to highlight the most correlated variables in a dataset.
In this plot, Correlation matrix is reordered according to the degree of association between variables.
The correlation plot shows high correlation between revenue & budget, vote_count & popularity and revenue & votecount

Linear Regression:

Intially we run a full model on the full dataset to find correlation between the top predictors and the target variables
After running the full model we find that our top predictors are budget, runtime , vote_count, genre_Crime, genre_Drama , genre_Animation, genre_Family and holiday_month
The R square value for the model comes out to be 0.7893
Now we run the model with our top predictors after splitting the dataset into test, train with 75/25 split ratio and choosing the random sample for each set
The accuracy for this model is 0.7571309
We classify the above model as a better model than the previous linear model beacuse of dataset being split into test and train

Result:

Diagnostic plots:

diagnostic plots Residuals vs Fitted: This plot shows that the residuals have non-linear patterns. There is a non-linear relationship between predictor variables and an outcome variable and the pattern shows up in this plot if the model doesn’t capture the non-linear relationship.

Normal Q-Q:A Q-Q plot compares the quantiles of a dataset and a set of theoretical quantiles from a probability distribution.Therefore it basically compares every observed value against a standard normal distribution with the same number of points. The graph is “skewed right,” meaning that most of the data is distributed on the left side with a long “tail” of data extending out to the right.

Scale Location: This plot is similar to the residuals versus fitted values plot, but it uses the square root of the standardized residuals. Like the first plot, there should be no discernable pattern to the plot.

Residuals vs Leverage: The Influence of an observation can be thought of in terms of how much the predicted scores would change if the observation is excluded. Cook’s Distance is a pretty good measure of influence of an observation. The leverage of an observation is based on how much the observation’s value on the predictor variable differs from the mean of the predictor variable. The more the leverage of an observation , the greater potential that point has in terms of influence. In this plot the dotted red lines are cook’s distance and the areas of interest for us are the ones outside dotted line on top right corner or bottom right corner. If any point falls in that region , we say that the observation has high leverage or potential for influencing our model is higher if we exclude that point. ts not always the case though that all outliers will have high leverage or vice versa. In this case observation #1 & #96 has high leverage and our choices are Justify the inclusion of #1 & #96 and keep the model as is, Include quadratic term as indicated by Residual vs fitted plot and remodel and Exclude observation #1 & #96 and remodel.

Ridge Regression:

Ridge regression uses L2 regularisation to weight/penalise residuals when the parameters of a regression model are being learned.
Ridge attempts to minimize residual sum of squares of predictors in a given model. However, ridge regression includes an additional ‘shrinkage’ term – the square of the coefficient estimate – which shrinks the estimate of the coefficients towards zero. The impact of this term is controlled by another term, lambda (determined seperately).
Ridge Regression is a commonly used technique to address the problem of multi-collinearity.
The glmnet package provides the functionality for ridge regression via glmnet(), it requires a vector input and matrix of predictors.
Ridge regression involves tuning a hyperparameter, lambda. glmnet() will generate default values for you.
Ridge regression involves tuning a hyperparameter, lambda, glmnet() runs the model many times for different values of lambda. We can automatically find a value for lambda that is optimal by using cv.glmnet() as follows:
```
ridge_mod = glmnet(x_train, y_train, alpha=0, lambda = lambda)
```
Result:
Shows the effect of collinearity in the coefficients of an estimator
Ridge Regression is the estimator used in this example. Each color represents a different feature of the coefficient vector, and this is displayed as a function of the regularization parameter
The above graph also shows the usefulness of applying Ridge regression to highly ill-conditioned matrices. For such matrices, a slight change in the target variable can cause huge variances in the calculated coefficients. Therefore, it is useful to set a certain regularization (lambda) to reduce this variation (noise).
When lambda is very large, the regularization effect dominates the squared loss function and the coefficients tend to zero
At the end of the path, as lambda tends toward zero and the solution tends towards the ordinary least squares, coefficients exhibit big oscillations. In practise it is necessary to tune lambda in such a way that a balance is maintained between both.

cross validationrr

cv.glmnet() uses cross-validation to work out how well each model generalises, which we can visualise as:
The lowest point in the curve indicates the optimal lambda: the log value of lambda that best minimised the error in cross- validation. We can extract this values as

opt_lambda <- cv.ridge.out$lambda.min
opt_lambda

The best lambda value is 0.01

Lasso Regression:

Result: lasso regression

lasso regressionrr

Decision Trees:

Result: rt1

Random Forest:

Result:

V. Conclusion:

nishantsingh93 / Movie-Revenue-Prediction-

readme