Tweaking model parameters to improve predictions

bkowshik commented 7 years ago

NOTE: Have updated this post to reflect new performance numbers and graph.

Tweaking model parameters seems to have quite a lot of impact on the results. The results are looking so much better when compared to the previous run.

Cross validation score: 0.8382
Get more predictions potentially harmful predictions from the model, 1775 changesets
The training time of the model ⬆️, from 17 seconds in the last run to 5 minutes.
Learning curve is looking good too.

index

Current best model parameters for the SVC model:

{
    "probability": true,
    "C": 10000,
    "gamma": "auto",
    "cache_size": 800,
    "class_weight": "balanced",
    "kernel": "rbf"
}

Next actions

[x] Iterate more on value combinations for model parameters
[x] Post model parameters of the model with best performance

cc: @anandthakker

bkowshik commented 7 years ago

In the scenario where all changesets predicted to be potentially harmful on osmcha are 👀 by users on osmcha, I was wondering, 💭 if we could calculate the hit rate as follows.

For labelled changeset data:

Changesets predicted harmful and labelled harmful on osmcha: 1,201
Changesets predicted harmful but labelled not harmful on osmcha: 1,964
Total changesets predicted harmful: 1201 + 1964 = 3,165
Hit rate: 1201 * 100 / 3165 = 37.95% 😬

i.e: If 100 changesets labelled by the current model as potentially problematic are manually 👀, we should potentially find 37 changesets to be actually problematic.

NOTE: Posting here for feedback on if this is the right way to measure Hit Rate.

bkowshik commented 7 years ago

The model parameters that yielded the best model performance are:

{
    "probability": true,
    "C": 10000,
    "gamma": "auto",
    "cache_size": 800,
    "class_weight": "balanced",
    "kernel": "rbf"
}

mapbox / gabbar

Tweaking model parameters to improve predictions #25

Next actions