Closed maksymiuks closed 4 years ago
Hey Szymon
Thanks for the feedback. It would be indeed wierd if the response needs a swap here!
Ranger has an encoding issue with character (non-factor) columns, see https://github.com/imbs-hl/ranger/issues/502
Since your model only uses factors, this cannot be the reason.
Is predict
working properly?
library(flashlight)
library(MetricsWeighted)
library(ranger)
set.seed(1)
data(titanic_imputed, package = "DALEX")
ranger_model <- ranger(survived~.,
data = titanic_imputed,
classification = TRUE,
probability = TRUE)
custom_predict <- function(X.model, new_data) {
predict(X.model, new_data)$predictions[, 2]
}
fl <- flashlight(model = ranger_model,
data = titanic_imputed, y = "survived", label = "Titanic Ranger",
metrics = list(auc = AUC),
predict_function = custom_predict)
# Use predict method of flashlight
predict(fl, data=head(titanic_imputed))
0.09135881 0.27651991 0.11710044 0.59006178 0.72432168 0.23170322
# Use predict method of ranger
predict(ranger_model, head(titanic_imputed))$predictions[, 2]
0.09135881 0.27651991 0.11710044 0.59006178 0.72432168 0.23170322
Looks good to me.
How does the distribution of the predictor looks like?
hist(titanic_imputed$fare)
Now, it looks as if we have identified the problem: Very skewed distribution, so most evaluation points of ALE use only very few observations!
Select evaluation points in the dense part of the covariable.
evaluate_at <- 0:100
pdp <- light_profile(fl, v = "fare", pd_evaluate_at = evaluate_at)
plot(pdp)
ale <- light_profile(fl, v = "fare", type = "ale", pd_evaluate_at = evaluate_at)
plot(ale)
Now, there is some similarity across method. The differences are probably coming from correlation with parch
, and class
:
boxplot(fare~class, data = titanic_imputed)
boxplot(fare~parch, data = titanic_imputed)
So I'd actually expect differences between PDP and ALE as the "everything else being fixed" logic behind PDP is not realistic.
Thank You for an extensive response. I really appreciate it. Indeed look like the problem is in skewed distribution and me using default parameters.
Once again thanks :)
Hi,
First of all, I'd like to share my amazement with that package!
However, during my research work, I've encountered a weird behavior of function that generates variable profiles. Let me show it
Here we see rather correct ALE plot for provided data. The general direction in those data should be the more particular passenger had paid, it's more possible he survived. However the plot has been creating using the probability of 0 class, this will be important. Now let's create pdp plot
Using the same column, which contains the probability of belonging to 0 class, we get an inversed pdp plot, it shows that probability decreases along with fare value increase. To get a plot that seems proper I had to swap columns in
custom_predict
so it indicates the probability of belonging to 1 class.Overall it looks like one of the function inverses the probabilities. Is it intended?
Best regards Szymon Maksymiuk