Effect of attributes on the feature level classifier

bkowshik commented 7 years ago

Similar to work on training size, we have questions on effect of number of attributes on model:

Does the model have enough attributes
What attributes contribute how much to model metrics
Can less attributes be better in the long term

Workflow

Get a list of all attributes available for training
Increase the training attributes appending one at a time from the attributes list
Train a model with these attributes from the training dataset
Get predictions from the model on this subset of attributes from the validation dataset
Store model metrics on the validation dataset and plot

Notes

index

There are interesting dips in metrics when the following attributes are added to the list of attributes:
- user_changesets_with_discussions_count
- old_user_name_special_characters_count
- feature_version
- feature_has_website_old
- iD
- Vespucci
The metrics somewhat reach their maximum around the 20 attributes mark except for the occasional dips
I am not sure what else to read out off of this graph.

cc: @anandthakker @batpad @geohacker

bkowshik commented 7 years ago

What would it look like when attributes are added in order of importance for prediction instead of in the order they appear in the csv dataset?

The GradientBoostingClassifier provides a method, model.feature_importances_ that gives out scores for feature importance, the higher the score the more important the feature for predictions.

Table with 10 attributes that have the highest importance scores

Now, using the same workflow as ^, we add one attribute at a time but starting with the most important attributes to get the graph below.

index

Because, we have the best attributes first, the metrics very quickly reach their max value. This is something we expect to happen.
We unusually get large dips even when we are well through 50+ attributes
The dips are now for the following attributes:
- feature_name_translations_count_old
- place
- MAPS.ME
- feature_area
- sport_old
- office
- power
- railway_old
- barrier_old
- railway
- historic
- changeset_comment_naughty_words_count
- public_transport_old
- route
There isn't any attribute common between the list ^ and the list ^^

bkowshik commented 7 years ago

After increasing the dataset size, still see the unusual dips. 🤔

index

mapbox / gabbar

Effect of attributes on the feature level classifier #59

Workflow

Notes