Open idellang opened 2 years ago
Hee, thanks for reporting. I happen to know the maintainer of this project is on paternity leave :) @sbjelogr Perhaps you can have a look ?
Sure! I think you just need to add a negative sign somewhere in the equation. I followed the code and added a negative sign when multiplying WoE and Coef. and was able to get the same results in the example. I'm not really familiar yet with OOP so I just created a custom function
Thanks for this issue. I think the ScoreCardPoints is actually quite broken and I propose to remove it.
Looking at a minimal example, we see that the woe_dict
cannot deal with the "Other" and "Missing" categories, so using the encoder to calculate the WoE for these cases doesn't work. This produces lots of missing values later on and breaks the point mapping.
@sbjelogr @timvink It seems the ScoreCardPoints
method can just be replaced with the calibrate_to_masterscale
function, or am I missing another use for this? I propose to remove it.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from skorecard import datasets
from skorecard import Skorecard
from skorecard.bucketers import OrdinalCategoricalBucketer
from skorecard.rescale import calibrate_to_master_scale, ScoreCardPoints
X, y = datasets.load_uci_credit_card(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X[["EDUCATION", "MARRIAGE"]], y)
o = OrdinalCategoricalBucketer(variables=["EDUCATION"])
sc = Skorecard(
bucketing=o,
variables=["EDUCATION"],
calculate_stats=True
)
sc.fit(X_train, y_train)
scp = ScoreCardPoints(skorecard_model=sc, pdo=25, ref_score=400, ref_odds=20)
sc.bucket_table("EDUCATION")
bucket | label | Count | Count (%) | Non-event | Event | Event Rate | WoE | IV |
---|---|---|---|---|---|---|---|---|
-2 | Other | 57.0 | 1.27 | 53.0 | 4.0 | 0.070175 | 1.373 | 0.016 |
-1 | Missing | 0.0 | 0.00 | 0.0 | 0.0 | NaN | 0.000 | 0.000 |
0 | 2.0 | 2026.0 | 45.02 | 1505.0 | 521.0 | 0.257157 | -0.149 | 0.010 |
1 | 1.0 | 1662.0 | 36.93 | 1351.0 | 311.0 | 0.187124 | 0.259 | 0.023 |
2 | 3.0 | 755.0 | 16.78 | 557.0 | 198.0 | 0.262252 | -0.175 | 0.005 |
woe_enc = scp.skorecard_model.pipeline_.named_steps["encoder"]
woe_dict = woe_enc.mapping
woe_dict['EDUCATION']
EDUCATION 1 -0.258126 2 0.148666 3 0.177157 4 -1.171335 -1 0.000000 -2 0.000000 dtype: float64
See that the WoE for -1 and 2 is bad.
@orchardbirds, they are not exactly the same.
calibrate_to_master_scale
just takes the predicted probas and rescales them.
ScoreCardPoints
does the same via the transformer.
However, ScoreCardPoints
takes a Scorecard
model in input, and basically applies the selected features within the model (otherwise the calculation of the coefficients is wrong, as the points are distributed among more features).
In addition it provides an extra tabular representation of the points per feature per bucket
@idellang, I will be investigating this issue in the coming days. Keep you posted
Please excuse the way that I reported this issue. This is my first time reporting a GitHub issue. I get different results from the ScoreCardPoints object. The scores using calibrate_to_master_scale on the proba_train are different from the score using scp.transform(X_train). I believe the calibrate_to_master_scale scores were right.
EDIT: I tried following the last tutorial example 'Scorecard Model' and I encounter the same problem. Going through the example, I noticed that the coefficients from scorecard.get_stats() are negative and the scorecard.woe_transform(X_test) are positive values but I get positive coefficients and negative scorecard.woe_transform(X_test).
Check the following images. In this example, I used a single categorical variable educational attainment versus default rate. Thank you!