Methods Review - Githubissues

Note I wrote this as reading so some comments I write then later see that you did do the thing I was suggesting.

Data Cleaning/Data Discovery
- For initial discovery it is good to read data like you did but it is much more efficient to add in a type dictionary to the pd.read_csv so that no tmp Dataframe is stored and data is always stored efficiently.
- Also dtype category is superior to dtype object for categorical variables
- I would break train describe into categorical vs numerical (include='object' (or 'category') then exclude='object'))
- also check number of levels for each feature and what the few most and least frequent are. This often reveals problems in the data or levels that should be joined. (Think of two levels 'Comedy' and 'Comdy' for genre)
- Count of missing values is good but percentage is better. More than about 40% missing and you should consider not even using the feature.
- Really good plots for descriptive work. I'm going to steal these
- Maybe log scale revenue distplot? I see that you have a plot with logRevenue but it is much more visually appealing and interpretable to a large audience to have a plot where just the x-axis is log scaled
Data Processing/Feature Engineering
- Pandas has a built in date time format so you don't need to do str.split to get year pd.to_datetime then once it is a dtype datetime you can extract year (you should also do this at the very beginning in the read_csv function)
- I see you do end up using pd.to_datetime so maybe you just didn't want to use the format='%D%M%Y' (or something like that) arg in to_datetime? Usually infer date format works pretty well
- I see you have day of week and quarter, it would probably be nice to have month since quarter may not be fine enough. Depending on the method you use and size of data available you should try all 3.
- It would be very important to take inflation into account since the average good increases in price by about 2% per year anyway. You could do this by downloading the CPI and dividing the revenue by the CPI deflator to get a normalized revenue amount. An alternative version I would check would be a CPI that is specific to the movie industry.
- Same thing for population growth (it would be better to look at changes in demographics overall so how many people are there in each age range at each year and this data is also on FRED
- You need to think about what will happen if your test data has a level for a feature that your train dataset doesn't have. The way to handle this is to write your one hot encoding as part of an sklearn pipeline so that when you do cross validation it creates the one hot encoding for each fold individually instead of using the whole training dataset. This also applies to how you handle missing variables.
- I wouldn't create features using the whole dataset at once you need to write it so that it does it on a per fold basis (this is harder but will avoid generalization problems and make sure that your code will work on new data the way you intended)
- To impute missing values don't choose the default yourself. Use mean/median for numeric (you can also add a new feature to indicate you imputed this feature) and create a new level for categorical vars called missing. For more advanced methods use MICE or predict the feature you want to impute with your other features (not including y) then replace missing with predictions.
- I don't like the df.fillna(value=0.0, inplace = True) at the end of data_prep could be very problematic and would be better to raise an error or let missing cause a hard break.
- I don’t think that your data pre-processing does "remove categories with bias and low frequency" it looks like it makes sure that train and test have same categories and levels which if that is what it is doing will cause generalization problems since the test set won't be a holdout dataset anymore (you used info from the train set to change it in someway). Also you should printout what changes this part did to make sure it didn't do too much or too little. It is hard to see what happened there.
Model Building/Experiments
- All Models
- I would recommend rather than doing a 3 way split (train val test) which is what you did, instead do a train test split then do 5 fold cross val on train set (so it would really be 5 train 5 val sets and 1 test). There are a lot of ways to do the cross val (group, stratified, etc and which one is right depends on the data and if there is an autoregressive structure (predict yt using y{t-1} or x_{t-1}))
- need to return cross validation score on train dataset not score (RMSE) on whole dataset
- Need to Hyper-Parameter tune using either score on val set (if you do this there could be problems with optimization bias and potential overfitting) or tune using cv score on train set. GridSearchCV in sklearn is a really easy first step into Hyper-Parameter tuning (this is also the most computationally costly part of a ML project so it is important to know more about)
- you would optimize n_estimators, max_features, max_depth, learning_rate, how missing is imputed, subsample, num_iterations etc. (each algo calls the args something slightly different but I just lookup which ones affect algo complexity and optimization speed (e.g. learning rate, and momentum) the most and tune those)
- DecisionTreeRegressor: fair for benchmark but I usually use OLS (linear regression) as my benchmark. This is also helpful since you are using so many tree methods you may want to compare to a non tree based method.
- sklearn estimators have a score function that lets you return the score (which you can choose) on X_test, y_test
- you should try predicting y_new=log(y_old) directly. many methods perform better if y is more normally distributed
- LightGBM, XGBoost, and CatBoost. These aren't super necessary since you have so little data. These are made for when you need a faster GBM than what sklearn offers since you have so much data. Since you aren't hyperparam tuning they will function as just slightly different hyperparam settings for a GBM. Here is a link I read when I looked into it LINK though at the end the author doesn't compare the models well (doesn't tune correctly so train/test score are very different due to overfitting and the author isn't a very experienced data scientist but know the packages better than I do)
Model Evaluation/Performance
- In the function train_model I am worried that the variable prediction isn’t created correctly. The index for the validation set (valid_index) is different each time so you may be taking the average of predictions for different observations. We should talk about what you are trying to do here
Model Reporting/Plots
- Looks pretty good
- I would make more prominent a table showing train, eval, and test scores (and final model configuration score vs other configurations that didn’t perform as well)
Other notes
- I would also try methods that aren’t tree based. I would do ridge regression, maybe some other stuff like SVM or an ANN.
- I would also try out a dimensionality reduction method in the pipeline like TruncatedSVD or PCA. Also you could try rescaling input variables (i.e. normalization, box-cox transformation) and using different categorical encoding methods category-encoders
- I would also create a table showing training times and predictions on specific parts of the data (e.g. which day of the week does the model perform best/worst on)
- Have a clear conclusion for which model is the final model that you choose to use and why

srinivascreddy / MLND-Capstone

Methods Review #1