Closed zakirovde closed 4 years ago
Convertation goes OK, but instead of 26 variables I get only 14
TLDR: If you try to use the PMML document for scoring, do you get correct predictions or not?
It is a conversion feature (not a bug), that the PMML document only contains information about features that are truly needed for scoring. All no-op features are automatically excluded.
Convertation goes OK, but instead of 26 variables I get only 14
TLDR: If you try to use the PMML document for scoring, do you get correct predictions or not?
The thing is that PMML model has not the most important 14 variables, but just 14 first variables from the list. Like var1, var2, var3,.., var14. So, I get wrong predictions. Is there a way to turn off that future, so PMML model could contain all the variables I want it to have?
I don't like how you're constructing a feature matrix object here:
data = data.matrix(train[,leave] %>% select(-target_bin))
Define a proper data.frame
object in one place (not once for the xgboost()
function call, and another time for the genFMap()
function call - they might give different results?), and create a proper DMatrix
object based on it using the r2pmml::genDMatrix()
function.
I suspect that your data.frame
objects are not consistent, and that doing manual data.matrix
reorders data columns one more time.
Use the syntax provided in the README file (one central data.frame
, and then feeding it to r2pmm::genDMatrix()
and r2pmml::genFMap()
functions). This should work. Once you've verified this claim locally, only then start making your hacks.
I don't like how you're constructing a feature matrix object here:
This genDMatrix thing works!
label <- as.numeric(train$target_bin)-1
data <- as.matrix(train[,leave])
mode(data) <- 'double'
dtrain <- genDMatrix(df_y = label, df_X = data)
xgb_bl1 <- xgboost(data=dtrain,
eta = .15,
max_depth = 5,
nround=300,
subsample = 0.65,
colsample_bytree = 0.35,
objective = "binary:logistic"
)
PMML model consists of all the 26 variables, but somehow it's probability result differs from one I get using predict function:
predict(xgb_bl1, data.matrix(test[,leave]))
I'm not sure that this is correct use of predict
, so I tried even this:
dtest <- genDMatrix(df_X = test[,leave], df_y = NULL)
#if I don't declare df_y, then I get an error
predict(xgb_bl1, dtest)
But still get another output. Maybe I pass test dataset to PMML model in a wrong way? I just send send them in JSON format. But still, comparing to predict in both genDMatrix and source form, I get different results.
I suspect that the use of the data.matrix()
function is the problem here - perhaps it's reordering columns based on some internal logic? This leads to a situation where the ordering of columns is not consistent between train and test/predict runs.
This is how my integration tests are generated: https://github.com/jpmml/jpmml-r/blob/master/src/test/R/xgboost.R
The above code suggests that the predict function should work fine with a DMatrix
that contains only feature columns. Perhaps you need to spell out the name of the newdata
attribute?
predict(xgb_bl1, newdata = dtest)
The r2pmml::genDMatrix()
function is not particularly efficient nor elegant, but based on my experience it works much more reliably than an in-memory data.frame
-> matrix
-> DMatrix
conversion workflow.
Perhaps this function should be rewritten in Java to make it scale for bigger datasets (IIRC, the current R implementation didn't scale beyond 10k data rows).
The
r2pmml::genDMatrix()
function is not particularly efficient nor elegant, but based on my experience it works much more reliably than an in-memorydata.frame
->matrix
->DMatrix
conversion workflow.
But it's hard to understand r2pmml::genDMatrix()
function's behaviour. Because, as I get it, we can't see table view of dataset, converted by this function. And the most strange thing is that when I create such dataset:
dtrain <- genDMatrix(df_y = label, df_X = train[,leave])
leave
is the list of 26 variables, so we should get DMatrix of 26 variables and one target variable.
After that I train xgboost model on that dataset and try to check importances:
xgb.importance(colnames(train[,leave]), model=xgb_bl1)
I get an error Error in View : feature_names has less elements than there are features used in the model
Just for test I run such a code
xgb.importance(colnames(train[,1:200]), model=xgb_bl1)
And then I get a list of most important 131 variables! Where did it get such number, though I passed only 27 variables (including one target) to DMatrix train dataset?
And am I right, that when I want to pass test dataset to final PMML model, I should first convert it using genDMatrix
, and only after that pass to model?
BTW, didn't see any differences with or without newdata
use.
so we should get DMatrix of 26 variables and one target variable.
Keep your features in one data.frame
object, and the label column in another. IIRC, the xgboost()
function allows you to specify X
and y
attributes separately; if so, pass X
as a DMatrix
(generated using the r2pmml::getDMatrix()
function), and y
as a suitable vector object type.
There's no need to append y
to the feature matrix.
Where did it get such number, though I passed only 27 variables (including one target) to DMatrix train dataset?
It's because your dataset contains some categorical features. The r2pmml::genDMatrix()
function expands a single categorical feature column to multiple binary feature columns (one for each category level). If you observe the feature map definition as produced by the r2pmml::genFMap()
function, then you will find a data.frame
with 131 columns also.
TLDR: The XGBoost package specifies a very difficult data input/output interface. My genFMap()
and genDMatrix()
utility functions provide a slow but correct way of working with it. Unless you understand the internals of the XGBoost package very well, do not attempt any shortcuts (such as using data.matrix
instead of DMatrix
). Also, shortcuts that are valid with continuous features stop working when there are categorical features in the dataset.
Another point - the conversion to PMML is not broken. It simply points out that you were using the XGBoost package incorrectly (because your dataset is a mix of continuous and categorical features).
Keep your features in one
data.frame
object, and the label column in another. IIRC, thexgboost()
function allows you to specifyX
andy
attributes separately; if so, passX
as aDMatrix
(generated using ther2pmml::getDMatrix()
function), andy
as a suitable vector object type.There's no need to append
y
to the feature matrix.
Do you mean smth like this?
dtrain <- genDMatrix(df_y = NULL, df_X = train[,leave])
label <- as.numeric(train$target_bin)-1
xgb_bl1 <- xgboost(data = data, label=label,
eta = .15,
max_depth = 5,
nround=300,
subsample = 0.65,
colsample_bytree = 0.35,
objective = "binary:logistic"
)
dtest <- genDMatrix(df_X = test[,leave], df_y = NULL)
predict(xgb_bl1, newdata=dtest)
dtrain <- genDMatrix(df_y = NULL, df_X = train[,leave]) xgb_bl1 <- xgboost(data = data, label=label, ..)
Should be xgboost(data = dtrain, label = label, ..)
dtrain <- genDMatrix(df_y = NULL, df_X = train[,leave]) xgb_bl1 <- xgboost(data = data, label=label, ..)
Should be
xgboost(data = dtrain, label = label, ..)
Sure, just misspelled. Unfortunately, in this case xgboost ignores label:
Warning message:
In xgb.get.DMatrix(data, label, missing, weight) :
xgboost: label will be ignored.
Unfortunately, in this case xgboost ignores label:
So what? You're supplying a label directly to the xgboost()
function using the label
attribute.
Unfortunately, in this case xgboost ignores label:
So what? You're supplying a label directly to the
xgboost()
function using thelabel
attribute.
Yes, I declare label in xgboost call, but still get an error:
dtrain <- genDMatrix(df_y = NULL, df_X = train[,leave])
xgb_bl1 <- xgboost(data = dtrain, label=train$target_bin,
.....
Hi, Villu.
After a couple of test I've just found out, that if I drop all the factor variables, leaving only numeric ones, than I get the same prediction results in both R and PMML models. But if I use factor vars, then results become different.
Do you know if I should prepare categorical variables some special way or there is a mistake in rpmml
package's work?
My factor variables are like ones or zeros, and more complicated ones, like citynumber (e.g., 1 is London, 2 is New York, etc.).
Do you know if I should prepare categorical variables some special way or there is a mistake in rpmml package's work?
Categorical variables should use R's factor
data type. Both r2pmml::genFMap()
and r2pmml::genDMatrix()
then use this information to generate proper feature map and DMatrix
objects.
Do your categorical columns use the factor
data type? Or do they use character
?
Did you bother to look into my R-XGBoost to PMML integration tests (linked above)? They use categorical features, and their results are fully reproducible between R-XGBoost and (J)PMML.
Yes, they are factor ones. And for sure I've read your test examples.
The thing is when I use, e.g., factor variable which can be equal only 0 or 1, then I get wrong answer in PMML model. But when I turn it no numeric train$var <- as.numeric(train$var)-1
, then I get correct answer. That's why I'm so confused.
The thing is when I use, e.g., factor variable which can be equal only 0 or 1, then I get wrong answer in PMML model.
Now that's an interesting claim - need to check this behaviour myself.
I do intend to write a small technical article about "Converting R-XGBoost to PMML" fairly soon. Will use this issue and all its comments as a reference material.
So, for now, I leave only numerics and 0/1 factors. Hope to solve factor's behaviour soon. Thanks a lot for your brilliant packages and quick replies!
Hi.
I'm trying to convert my xgboost model to pmml format. Convertation goes OK, but instead of 26 variables I get only 14. The most interesting part is that they're 14 first variables from the list. E.g., if I drop any variable from that top-14, instead of it there goes the next 15th variable. I'm totally confused. Could someone suggest the decision? Or maybe it's a bug?