Data conversion for xgboost

PhilippPro commented 7 years ago

xgboost only accepts numerical features. We should decide how we convert factor variables. I implemented now the automatic conversion into a numeric by ordering it according to the levels and making a numerical variable out of the factor variables. Alternatively convert it to several binary features, but feature space can get big with features with many factors.

ja-thomas commented 7 years ago

[x] Dummy encoding
[ ] Impact encoding (http://www.win-vector.com/blog/2012/07/modeling-trick-impact-coding-of-categorical-variables-with-many-levels/)
[ ] Ordered integer encoding?

We need the second and third solution as wrappers. Otherwise we overfit.

Also this is a more general thing and not just relevant for xgboost

ja-thomas commented 7 years ago

Oh, I just realized this wasn't in the mlr repo but in the OMLbot.

DanielKuehn87 commented 7 years ago

I think we should have a look at the datasets we will run and preprocess them. Then use the preprocessed study_14 data for all learners. What I don't want to do is something like: if(learner == "xyz") then use this and that preprocessing.

PhilippPro commented 7 years ago

I see this differently. For me it would be very "unfair", if all variables would be converted to numeric because some learners can actually handle factors, and can handle them better than just the transformed numeric ones. I think we rather should think at a good method to transform the variables.

berndbischl commented 7 years ago

i agree with @PhilippPro. you cannot convert everything into a format that xgboost likes.

you need to add wrappers to such algorithms

DanielKuehn87 commented 7 years ago

I don't want to transform the data only for xgboost, but work on numeric datasets without NA values. We could either write a generic wrapper for this within R or create a transformed study_14 dataset. I think this is necessary so we can use the database for comparing learners later on. Afaik Ranger uses data.matrix to convert factors to numerics. If we now write a wrapper using model.matrix to convert factors for xgboost, we can't really compare both models, because they use different transformations for factors (and xgboost might perform better only because of this). We can also reduce the overhead for learners, that apply transformations like this, if we would use a transformed study_14.

PhilippPro commented 7 years ago

I think ranger does not convert factors to numerics. I think it tries to split factor variables trying out every subset of the levels that is available. I talked about this with Marvin recently. That's why it also cannot handle too many levels in factor variables (around 50 max). We can talk about that on thursday. ;)

DanielKuehn87 commented 7 years ago

https://www.r-bloggers.com/on-ranger-respect-unordered-factors/ I'm not sure, if this has changed in the current version though.

PhilippPro commented 7 years ago

Oh ok, I didn't know this. We have to be careful here, maybe make a new hyperparameter or just set it to a different value.

ja-thomas / OMLbots

Data conversion for xgboost #24