julianconstantine / Sberbank

Code for Sberbank Kaggle competition
0 stars 0 forks source link

Fix overfitting(?) with baseline GBM model #6

Closed julianconstantine closed 7 years ago

julianconstantine commented 7 years ago

I need to figure out why the distribution of predicted prices on the housing data is so heavily peaked at the center. There is no such bias on the training predictions, but it is very noticeable on the validation data.

This is probably some form of overfitting, but I also think it has something to do with my median-replacement scheme for all the null values (since the predictions all cluster toward the center of the actual distribution and have much lower variance).

julianconstantine commented 7 years ago

Even when I reduce the learning_rate from 0.1 to 0.01, the same overfitting phenomenon persists, except that now the clustering towards the center ALSO appears in the training predictions.

The feature "avg_price_per_building_per_month" routinely shows up as the most important feature by a mile, so perhaps it is median-replacement on this feature that is drastically overpowering everything else.

New idea: Create more building-level features and check how many NaNs I get when creating "avg_price_per_building_per_month." If there are a lot, that could be the source of this central clustering.

julianconstantine commented 7 years ago

I'm starting to think that this has more to do with the extremely high importance of the average price per building features. Those features are so important that the GBM "overfits" to them at the expense of other variables.

This made me think that if I used deeper trees (I started out using depth-5 trees), it might be able to capture variation that was being obscured by the building price features. However, I tried fitting with trees with max_depth=8 and that did not improve anything.

The next step then is to figure out a better way to replace the missing values, and/or to create more features.

julianconstantine commented 7 years ago

I tried running a (very thorough) GridSearchCV for the scikit-learn GradientBoostingRegressor model (using negative mean-squared error as the evaluation metric) and it pushed us down to n_estimators=60, learning_rate=0.01, and max_depth=2, i.e. to a very simple GBM model, which is consistent with overfitting.

However, this makes little sense since I have seen other people using GBMs with parameters indicating a much higher degree of (optimal) complexity in the dataset. My current approach has not yet gotten a score less than 0.50 on the leaderboard, but the LightGBM starter script I ran just on the base data + a few extra features achieves a score of 0.32.

I am starting to think that one of the features I created by hand is actually hurting the performance of the algorithm. Therefore, I am going to close this issue for now and create another one to investigate.