Improve RandomForestClassifier model parameters

ikmckenz commented 5 years ago

n_estimators: The Predit NPS paper says they used "500 trees". The model uses a LOT of RAM as the estimators increases. With n=200 trees, almost 64GB gets used in training, and the model is 18.6 GB compressed on disk, and this results in 59% accuracy while the default of only 10 trees results in 58% accuracy.

criterion: Default is "gini", should be fine as is. "entropy" is supposed to be more computationally intense, but not much better.

max_depth: Maybe experiment with decreasing this while increasing the estimators.

n_jobs: Should probably be -1 to use all cores while training and predicting.

ikmckenz commented 5 years ago

Comparisons of different n_estimators numbers with all others as defaults (and n_jobs at -1).

With 10 trees:

              precision    recall  f1-score   support 
    accuracy                           0.57     36526
   macro avg       0.60      0.59      0.59     36526
weighted avg       0.57      0.57      0.57     36526

With 100 trees:

              precision    recall  f1-score   support 
    accuracy                           0.59     36526
   macro avg       0.60      0.60      0.60     36526
weighted avg       0.58      0.59      0.58     36526

With 250 trees (uses ~55GB of RAM on training):

              precision    recall  f1-score   support 
    accuracy                           0.58     36526
   macro avg       0.60      0.60      0.60     36526
weighted avg       0.58      0.58      0.58     36526

With 500 trees (uses ~96GB of RAM on training):

              precision    recall  f1-score   support 
    accuracy                           0.59     36526
   macro avg       0.60      0.61      0.60     36526
weighted avg       0.58      0.59      0.59     36526

ikmckenz commented 5 years ago

There's no big improvement in accuracy with our current data set. Implemented a way to control the number of trees in the model and changed the default number of jobs to -1 in 8848a868. Will reconsider it when we can get the accuracy up with better data cleaning.

ikmckenz / target-pred-py

Improve RandomForestClassifier model parameters #4