marjoleinF / pre

an R package for deriving Prediction Rule Ensembles
58 stars 17 forks source link

Issue with generating rules with original features #26

Closed jjchry closed 3 years ago

jjchry commented 3 years ago

Hi,

I tried the pre model for my dataset with 240 variables.It seems there is some issue while the single features rules are generated. 1.)I have categorical columns in the dataset , after the dummy variable creation, my hypothetical variable 'X_Flag_lev_A' will have two values 0 and 1.When i run the importance(fit,standardize=True,round=4L),and access the base imps ,the feature would be created as

rule | description X_Flag_lev_A | 0<=X_Flag_lev_A<=1. This inturn is wrong as both boundary values are satisfied.(Another difference which i noticed is that the rule is having same name as the original feature name(X_Flag_lev_A) rather than rule name(eg rule1, rule 2)

2)The issue also is around the single feature rule.Consider a hypothetical numerical feature which has name and description as below rule | Description XX_amount | 5<=XX_amount <=140 Considering the description , i ran condition on all records in my training data and created a flag variable with values Y and N,when 5<=XX_amount <=140 is satisfied and not satisfied respectively,I calculated the frequency table between new flag variable created and my binomial target . table (Flagvariable,Target) gives Lose Win Y 58 42 N 51 49

These inturn means that irrespective of whether the condition 5<=XX_amount <=140 is satisfied or not,its having a high chance to lose, But the coeffiecent of the rule XX_amount is positive .

I am not sure if these are any actual issues or lack of my understanding.It would be really helpful if you can have a look at these points

marjoleinF commented 3 years ago

This is a feature, not a bug ;)

Factors should not be converted to dummy coded variables, because that may negatively affect sparsity of the final ensemble and rule quality. Make sure factors are coded as such in the dataframe supplied to function pre().

The returned ensemble consists of rules and/or linear terms. Rules are referred to by 'rule[number]', linear terms by the original variable name. Thus, a term that does not have 'rule[number]' as a label is a linear term (predictor). The description column then provides the winsorizing points used for winsorizing this variable; it is not a description of a prediction rule. See sections 2.2 and 4.1 of Fokkema (2020).

For your dummy-coded variable, it appears that no winsorizing was performed, which would be almost always the case for dummy-coded variables (if winsorizing would be performed on a variable with only two levels, this variable becomes a constant and cannot be selected for the final ensemble).

The winsorizing points reflect the .05 and .95 quantiles of the univariate distribution, so I'm not sure how you got the frequencies in the confusion matrix for XX_amount. Should be relative proportions of about .10 vs. .90, not around .50-.50.

Hope this helps! If you read anywhere that factors need to be coded as dummy indicators, I would like to know. I've heard more users assume this, but I don't know what the source of this is, perhaps I can correct or clarify this.

Fokkema, M. (2020). Fitting Prediction Rule Ensembles with R Package pre. Journal of Statistical Software, 92(1), 1-30. https://www.jstatsoft.org/article/view/v092i12

jjchry commented 3 years ago

Thank You Marjolein Fokkema

I have understood that the issue is caused by keeping the factors converted as dummy.Also I confirm that its not mentioned in the document that factors needs to be dummy coded.This happens as a general intuiton as most of the packages would need the factors to be dummy coded .

The 2nd issue that i mentioned was not with respect to winzorizing .This happens to some of the rules also.What i meant to do was to explain the coefficient based on what happened in the training data.

Another hypothetical example Rule | Description RuleXX | XX_amount <=140

table (RuleXX,Target) gives Lose Win Y 58 42 N 51 49

What i am trying to see the conditional probability of Win|Condition Satisfied ,Win|Condition Not Satisfied ,Lose|Condition Satisfied ,Lose|Condition Not Satisfied based on the training data.

So in this case p(Win|ConditionSatisfied) = 42/(58+42)=.42 p(Lose|ConditionSatisfied) = 58/(58+42)=.58 p(WIn|ConditionNotSatisfied) = 49/(51+49)=.49 p(Lose|ConditionNotSatisfied) = 51/(51+49)=.51

In this case irrespective of whether condition is satisfied or not, chances of Lose is high, But the RuleXX is given a positive coefficient.It can be a case of confounding variable , but just wanted to double check if am doing anything wrong.

Another fact which is noticed is that if the dependant variable is an ordered factor, an error gets populated saying that the response variable is not a factor .This was bit confusing for me, and I debugged and could find that its an issue with ordering of the factor.Model works fine when ordering is removed. To reproduce library("DALEX") library(pre) titanic_train <- titanic[,c("survived", "class", "gender", "age", "sibsp", "parch", "fare", "embarked")] titanic_train$survived <- factor(titanic_train$survived,ordered = TRUE) titanic_train$gender <- factor(titanic_train$gender) titanic_train$embarked <- factor(titanic_train$embarked) titanic_train <- na.omit(titanic_train)

fit <- pre(survived ~ ., data = titanic_train,family = "binomial")

boennecd commented 3 years ago

I have understood that the issue is caused by keeping the factors converted as dummy.Also I confirm that its not mentioned in the document that factors needs to be dummy coded.This happens as a general intuiton as most of the packages would need the factors to be dummy coded .

Almost all base R methods wants factors and not dummy codes as model.frame and model.matrix takes care of the coding and avoid linear dependence. We do the same as, as does many other R packages.

The 2nd issue that i mentioned was not with respect to winzorizing .This happens to some of the rules also.What i meant to do was to explain the coefficient based on what happened in the training data.

Another hypothetical example Rule | Description RuleXX | XX_amount <=140

table (RuleXX,Target) gives Lose Win Y 58 42 N 51 49

What i am trying to see the conditional probability of Win|Condition Satisfied ,Win|Condition Not Satisfied ,Lose|Condition Satisfied ,Lose|Condition Not Satisfied based on the training data.

So in this case p(Win|ConditionSatisfied) = 42/(58+42)=.42 p(Lose|ConditionSatisfied) = 58/(58+42)=.58 p(WIn|ConditionNotSatisfied) = 49/(51+49)=.49 p(Lose|ConditionNotSatisfied) = 51/(51+49)=.51

The marginal intuition may be wrong if there is more than one rule. It can be an example of Simpson's paradox. If there is only one rule, then this result seems odd/wrong but only then.

jjchry commented 3 years ago

Thank You boennecd.Closing this Thread

marjoleinF commented 3 years ago

Thanks for reporting this issue!

Another fact which is noticed is that if the dependant variable is an ordered factor, an error gets populated saying that the response variable is not a factor .This was bit confusing for me, and I debugged and could find that its an issue with ordering of the factor.Model works fine when ordering is removed.

Fixed in the current development version. Ordered factors as response variables will be treated as unordered factors (as pre() has no functionality for ordered factors), and a warning is issued about this.