Variable disappearance in xgboost model

zakirovde commented 5 years ago

Hi.

I'm trying to convert my xgboost model to pmml format. Convertation goes OK, but instead of 26 variables I get only 14. The most interesting part is that they're 14 first variables from the list. E.g., if I drop any variable from that top-14, instead of it there goes the next 15th variable. I'm totally confused. Could someone suggest the decision? Or maybe it's a bug?

#target_bin is the binary target variable
#leave - is the list of variables, leave <- which(names(train) %in% c('target_bin', 'var1', ...'var26'))
#then the model itself
xgb_bl1 <- xgboost(data = data.matrix(train[,leave] %>% select(-target_bin)), 
                  label = data.matrix(train[,leave]$target_bin), 
                  eta = .15,
                  max_depth = 5, 
                  nround=300, 
                  subsample = 0.65,
                  colsample_bytree = 0.35,
                  objective = "binary:logistic"
                  )
leave.fmap = genFMap(train[,leave] %>% select(-target_bin))
r2pmml(xgb_bl1, "xgb_virtu.pmml", fmap = leave.fmap, response_name = "target_bin", response_levels = c("0", "1"))

vruusmann commented 5 years ago

Convertation goes OK, but instead of 26 variables I get only 14

TLDR: If you try to use the PMML document for scoring, do you get correct predictions or not?

It is a conversion feature (not a bug), that the PMML document only contains information about features that are truly needed for scoring. All no-op features are automatically excluded.

zakirovde commented 5 years ago

Convertation goes OK, but instead of 26 variables I get only 14

TLDR: If you try to use the PMML document for scoring, do you get correct predictions or not?

The thing is that PMML model has not the most important 14 variables, but just 14 first variables from the list. Like var1, var2, var3,.., var14. So, I get wrong predictions. Is there a way to turn off that future, so PMML model could contain all the variables I want it to have?

vruusmann commented 5 years ago

I don't like how you're constructing a feature matrix object here:

data = data.matrix(train[,leave] %>% select(-target_bin))

Define a proper data.frame object in one place (not once for the xgboost() function call, and another time for the genFMap() function call - they might give different results?), and create a proper DMatrix object based on it using the r2pmml::genDMatrix() function.

I suspect that your data.frame objects are not consistent, and that doing manual data.matrix reorders data columns one more time.

Use the syntax provided in the README file (one central data.frame, and then feeding it to r2pmm::genDMatrix() and r2pmml::genFMap() functions). This should work. Once you've verified this claim locally, only then start making your hacks.

zakirovde commented 5 years ago

I don't like how you're constructing a feature matrix object here:

This genDMatrix thing works!

label <- as.numeric(train$target_bin)-1
data <- as.matrix(train[,leave])
mode(data) <- 'double'
dtrain <- genDMatrix(df_y = label, df_X = data)
xgb_bl1 <- xgboost(data=dtrain,
                  eta = .15,
                  max_depth = 5, 
                  nround=300, 
                  subsample = 0.65,
                  colsample_bytree = 0.35,
                  objective = "binary:logistic"
                  )

PMML model consists of all the 26 variables, but somehow it's probability result differs from one I get using predict function: predict(xgb_bl1, data.matrix(test[,leave])) I'm not sure that this is correct use of predict, so I tried even this:

dtest <- genDMatrix(df_X = test[,leave], df_y = NULL)
#if I don't declare df_y, then I get an error
predict(xgb_bl1, dtest)

But still get another output. Maybe I pass test dataset to PMML model in a wrong way? I just send send them in JSON format. But still, comparing to predict in both genDMatrix and source form, I get different results.

vruusmann commented 5 years ago

I suspect that the use of the data.matrix() function is the problem here - perhaps it's reordering columns based on some internal logic? This leads to a situation where the ordering of columns is not consistent between train and test/predict runs.

This is how my integration tests are generated: https://github.com/jpmml/jpmml-r/blob/master/src/test/R/xgboost.R

The above code suggests that the predict function should work fine with a DMatrix that contains only feature columns. Perhaps you need to spell out the name of the newdata attribute?

predict(xgb_bl1, newdata = dtest)

vruusmann commented 5 years ago

The r2pmml::genDMatrix() function is not particularly efficient nor elegant, but based on my experience it works much more reliably than an in-memory data.frame -> matrix -> DMatrix conversion workflow.

Perhaps this function should be rewritten in Java to make it scale for bigger datasets (IIRC, the current R implementation didn't scale beyond 10k data rows).

zakirovde commented 5 years ago

The r2pmml::genDMatrix() function is not particularly efficient nor elegant, but based on my experience it works much more reliably than an in-memory data.frame -> matrix -> DMatrix conversion workflow.

But it's hard to understand r2pmml::genDMatrix() function's behaviour. Because, as I get it, we can't see table view of dataset, converted by this function. And the most strange thing is that when I create such dataset: dtrain <- genDMatrix(df_y = label, df_X = train[,leave]) leave is the list of 26 variables, so we should get DMatrix of 26 variables and one target variable. After that I train xgboost model on that dataset and try to check importances: xgb.importance(colnames(train[,leave]), model=xgb_bl1) I get an error Error in View : feature_names has less elements than there are features used in the model Just for test I run such a code xgb.importance(colnames(train[,1:200]), model=xgb_bl1) And then I get a list of most important 131 variables! Where did it get such number, though I passed only 27 variables (including one target) to DMatrix train dataset?

And am I right, that when I want to pass test dataset to final PMML model, I should first convert it using genDMatrix, and only after that pass to model?

BTW, didn't see any differences with or without newdata use.

vruusmann commented 5 years ago

so we should get DMatrix of 26 variables and one target variable.

Keep your features in one data.frame object, and the label column in another. IIRC, the xgboost() function allows you to specify X and y attributes separately; if so, pass X as a DMatrix (generated using the r2pmml::getDMatrix() function), and y as a suitable vector object type.

There's no need to append y to the feature matrix.

Where did it get such number, though I passed only 27 variables (including one target) to DMatrix train dataset?

It's because your dataset contains some categorical features. The r2pmml::genDMatrix() function expands a single categorical feature column to multiple binary feature columns (one for each category level). If you observe the feature map definition as produced by the r2pmml::genFMap() function, then you will find a data.frame with 131 columns also.

TLDR: The XGBoost package specifies a very difficult data input/output interface. My genFMap() and genDMatrix() utility functions provide a slow but correct way of working with it. Unless you understand the internals of the XGBoost package very well, do not attempt any shortcuts (such as using data.matrix instead of DMatrix). Also, shortcuts that are valid with continuous features stop working when there are categorical features in the dataset.

vruusmann commented 5 years ago

Another point - the conversion to PMML is not broken. It simply points out that you were using the XGBoost package incorrectly (because your dataset is a mix of continuous and categorical features).

zakirovde commented 5 years ago

Keep your features in one data.frame object, and the label column in another. IIRC, the xgboost() function allows you to specify X and y attributes separately; if so, pass X as a DMatrix (generated using the r2pmml::getDMatrix() function), and y as a suitable vector object type.

There's no need to append y to the feature matrix.

Do you mean smth like this?

dtrain <- genDMatrix(df_y = NULL, df_X = train[,leave])
label <- as.numeric(train$target_bin)-1
xgb_bl1 <- xgboost(data = data, label=label,
                  eta = .15,
                  max_depth = 5, 
                  nround=300, 
                  subsample = 0.65,
                  colsample_bytree = 0.35,
                  objective = "binary:logistic"
                  )
dtest <- genDMatrix(df_X = test[,leave], df_y = NULL)
predict(xgb_bl1, newdata=dtest)

vruusmann commented 5 years ago

dtrain <- genDMatrix(df_y = NULL, df_X = train[,leave]) xgb_bl1 <- xgboost(data = data, label=label, ..)

Should be xgboost(data = dtrain, label = label, ..)

zakirovde commented 5 years ago

dtrain <- genDMatrix(df_y = NULL, df_X = train[,leave]) xgb_bl1 <- xgboost(data = data, label=label, ..)

Should be xgboost(data = dtrain, label = label, ..)

Sure, just misspelled. Unfortunately, in this case xgboost ignores label:

Warning message:
In xgb.get.DMatrix(data, label, missing, weight) :
  xgboost: label will be ignored.

vruusmann commented 5 years ago

Unfortunately, in this case xgboost ignores label:

So what? You're supplying a label directly to the xgboost() function using the label attribute.

zakirovde commented 5 years ago

Unfortunately, in this case xgboost ignores label:

So what? You're supplying a label directly to the xgboost() function using the label attribute.

Yes, I declare label in xgboost call, but still get an error:

dtrain <- genDMatrix(df_y = NULL, df_X = train[,leave])
xgb_bl1 <- xgboost(data = dtrain, label=train$target_bin,
.....

zakirovde commented 5 years ago

Hi, Villu.

After a couple of test I've just found out, that if I drop all the factor variables, leaving only numeric ones, than I get the same prediction results in both R and PMML models. But if I use factor vars, then results become different. Do you know if I should prepare categorical variables some special way or there is a mistake in rpmmlpackage's work? My factor variables are like ones or zeros, and more complicated ones, like citynumber (e.g., 1 is London, 2 is New York, etc.).

vruusmann commented 5 years ago

Do you know if I should prepare categorical variables some special way or there is a mistake in rpmml package's work?

Categorical variables should use R's factor data type. Both r2pmml::genFMap() and r2pmml::genDMatrix() then use this information to generate proper feature map and DMatrix objects.

Do your categorical columns use the factor data type? Or do they use character?

Did you bother to look into my R-XGBoost to PMML integration tests (linked above)? They use categorical features, and their results are fully reproducible between R-XGBoost and (J)PMML.

zakirovde commented 5 years ago

Yes, they are factor ones. And for sure I've read your test examples. The thing is when I use, e.g., factor variable which can be equal only 0 or 1, then I get wrong answer in PMML model. But when I turn it no numeric train$var <- as.numeric(train$var)-1, then I get correct answer. That's why I'm so confused.

vruusmann commented 5 years ago

The thing is when I use, e.g., factor variable which can be equal only 0 or 1, then I get wrong answer in PMML model.

Now that's an interesting claim - need to check this behaviour myself.

I do intend to write a small technical article about "Converting R-XGBoost to PMML" fairly soon. Will use this issue and all its comments as a reference material.

zakirovde commented 5 years ago

So, for now, I leave only numerics and 0/1 factors. Hope to solve factor's behaviour soon. Thanks a lot for your brilliant packages and quick replies!

jpmml / r2pmml

Variable disappearance in xgboost model #56