kogalur / randomForestSRC

DOCUMENTATION:
https://www.randomforestsrc.org/
GNU General Public License v3.0
115 stars 18 forks source link

minimal object for prediction, or how to remove sensitive xvar data #91

Closed rkb965 closed 3 years ago

rkb965 commented 3 years ago

Hi, thanks for writing this! Huge fan of the package.

I would like to create the minimal necessary object for prediction. I would like to share an rfsrc object so that it can be used for prediction by someone else, but I am dealing with sensitive data that cannot be shared and thus need to strip away all values that were used for prediction. Is this possible?

library(randomForestSRC)
library(dplyr)

data(pbc, package = "randomForestSRC")

# using all information
train <- sample(1:nrow(pbc), round(nrow(pbc) * .7))

obj <- rfsrc(Surv(days, status) ~ .,
                  data = pbc[train,])

pred.hasy <- predict(obj, pbc[-train,])
head(pred.hasy$yvar)
head(pred.hasy$survival)

pbc.test <- pbc[-train,]
pbc.test <- pbc.test %>% select(-days, -status)
pred.noy <- predict(obj, pbc.test)
head(pred.noy$yvar)
head(pred.noy$survival)

# this works but still seems to have xvar data in obj.trim$forest
obj.trim <- obj
obj.trim$xvar <- NULL
pred.trim <- predict(obj.trim, pbc.test)
head(pred.trim$survival)

# this does not work
obj.trim2 <- obj.trim
obj.trim2$forest$xvar <- NULL
pred.trim2 <- predict(obj.trim2, pbc.test)

The error message seems to indicate that the xvar is used to get information on factor levels. Is there some way to pass a single (synthetic) row of data there that contains the relevant levels? Are training data values stored anywhere else? Is there a more minimal version of the object that would contain enough information for prediction?

EDIT: I also want to do this to create as small of a file as possible for sharing purposes. I'd ideally like to be able to share the object via github but the resulting file is currently way too large.

Really grateful for your time. Thank you for any help!

kogalur commented 3 years ago

We are closing this issue and have reopened issue #52 because this is essentially the same as that thread. I'll post my comments there.