Closed SimonCoulombe closed 6 years ago
One can look at the caret implementation of xgboost via
caret::getModelInfo("xgbTree", FALSE)[[1]]
I do not see that any offset is used explicitly. But one idea might be to define your own xgboost method that uses an additional parameter:
my_xgbTree <- caret::getModelInfo("xgbTree", FALSE)[[1]]
my_xgbTree$fit <- function(x, y, wts, param, lev, last, classProbs, ...) {
#same code as caret::getModelInfo("xgbTree", FALSE)[[1]]$fit but with the additional lines
theDots <- list(...)
offset <- theDots$offset
if (!is.null(offset))
xgboost::setinfo(x, 'base_margin', offset)
}
offset is a parameter that you provide with train() and will therefore be included in theDots
train{...,method=my_xgbTree ,offset=...,
I'd suggest to try the methods on simulated data so that you know when it is working or not
thanks, took me a longer while than I care to admit, but I got it :)
After trying this out myself I realized that this is not right because the offset is not sorted correctly (rows will be permuted due to the resampling). I did not find any other solution than to use weights ("wts") in order to handle the offset. Unfortunately there seems to be no better solution in caret :(
I suggest to check your results with a simple Poisson GLM that requires no training. If indeed train() and glm() give the same parameters, I'd be happy to see your solution :)
The newer modeling system that we're developing would be able to handle this case. recipes
would allow you to carry the offset variable throughout the resampling process. This would make a good case study for that.
I'm at a conference right now, but I'll try to use your example to demonstrate how this works in a vignette. The new system doesn't have parameter tuning implemented yet though...
Out of curiosity, why use a Poisson assumption when the data are binary? Did you mean to use numclaims
?
Also, can you explain what the purpose of myConstraint
is?
Hi Max, Two good questions :)
I should have used numclaims
indeed, my bad. I made that mistake because of the way another dataset I use is constructed.
The monotone_constraints
does not help in this example because there is no reason for the frequency of crashs to increase with vehicule value. It would be very useful if there was a"miles driven per week" variable instead, or if we were modelling the severity (dollar value) of the claims.
best, Simon
The newer modeling system that we're developing would be able to handle this case.
recipes
would allow you to carry the offset variable throughout the resampling process. This would make a good case study for that.I'm at a conference right now, but I'll try to use your example to demonstrate how this works in a vignette. The new system doesn't have parameter tuning implemented yet though...
Hi Max, I'm wondering if you ended up doing that vignette. cheers
Nope. Not yet. Just created an issue though. tidymodels/recipes#210
Hi!
I am piggybacking on this issue (https://github.com/topepo/caret/issues/507) which asked questions about both offsets and early_stopping and the answer was that one would need to create a custom method.
I am re-opening this question, because it appears that offsets ("base_margins") are used by caret when we feed caret::train a "xgb.Dmatrix" object. At least, the predictions are very affected by the offsets.
However, it appears that the offsets are not used when saving the out-of-fold predictions generated using kfold.
If that is correct, is there are way to extract them? Maybe multiplying the result by offset/ mean(offset) ?
Here is an example code, widely inspired by the code at issue #507.