Closed sibipx closed 7 months ago
I am adding a concrete example based on simulated data and binary outcome (to keep it simple)
library(randomForestSRC)
set.seed(2023)
n_patients <- 5000
n_tree <- 1000
pat_length <- sample.int(n=6, size=n_patients, replace = TRUE)
pat_id <- lapply(1:length(pat_length), function(x) rep(x, pat_length[x]))
pat_day <- lapply(1:length(pat_length), function(x) cumsum(rep(1, length(pat_id[[x]]))))
pat_id <- unlist(pat_id)
pat_day <- unlist(pat_day)
X1 <- rnorm(length(pat_id), 0, 1)
X2 <- rnorm(length(pat_id), 0, 1)
bin_outcome <- unlist(lapply(1:length(pat_length),
function(x) rep(sample(c("event", "no_event"), 1, prob = c(0.1, 0.9)), pat_length[x])))
bin_outcome <- as.factor(bin_outcome)
df <- data.frame(pat_id, pat_day, X1, X2, bin_outcome)
head(df, 20)
# create inbags for ranger
create_inbag <- function(adm_ids, sample_fraction, num_trees){
adm_ids_unique <- unique(adm_ids)
n_id <- length(adm_ids_unique)
inbags <- list()
for (i in 1:num_trees){
s <- sample(adm_ids_unique, round(n_id * sample_fraction), replace = FALSE)
inbags[[i]] <- as.integer(adm_ids %in% s)
}
return(inbags)
}
inbags <- create_inbag(df$pat_id, 0.5, n_tree)
# make matrix inbags
inbags <- matrix(unlist(inbags), nrow=length(inbags[[1]]))
RF_obj <- rfsrc(bin_outcome ~ ., df,
ntree = n_tree,
# specify subsamples
bootstrap = "by.user",
samp = inbags,
importance = "none")
I have meanwhile explored with 3 possible ways to go around the issue for model tuning (based on OOB error, preferably).
I have chosen to continue with the 3rd solution: "Crop" the "inbags" on the minumum size. I would appreciate if you'd share any thoughts, either on other solutions to go around the issue or on making any solution faster in terms of runtime.
I have to mention that I have a large sample size (> 150.000 observations, > 200 features) but few events of interest. I aim to compare different models (binary, multinomial, survival, competing risks) in prediction settings. I run everything on HPC, but still, it takes a while. That's why I chose the fastest solution.
Thanks!
I think your third option is the best. Unfortunately, the requirement that the sample size be constant is hard coded on the C-side. It will be impossible to work around or remove this limitation without significant modification to the code base. We don't really have this on our list of pressing coding campaigns.
With little disappointment in my heart, I thank you for the quick answer. I will continue with option 3. I will close the issue. Also thank you for a package with so much functionality for the user!
Is there any possibility to skip the check on sampsize for user specified samples?
I have data for more patients encounters that are of different length. I want a full patient encounter to fall either in-bag or out-of-bag, because this is how in reality predictions will be made (predictions will be made for new patient encounters). Moreover, If observations from the same patient fall both in-bag and out-of-bag it will encourage the forest to "memorize" the patient (based on some important baseline features) I want to define the in-bag and out-of-bag samples based on patient encounters.
tree 1 inbag: P1 30 observations P3 10 observations ... out of bag: P2 25 observations ... tree 2 inbag: P2 25 observations P3 10 observations ... out of bag: P1 30 observations ...
I used this kind of custom sampling for ranger and worked just fine, but what I really need is to build competing risks models, for which I want to use rfsrc. (I need to make dynamic predictions for every patient day: day 1, day 2, day 3,... AND for different horizons in the future: next 7 days, next 14 days... in a competing events framework)
I could try to overwrite the function code in some way (I didn't try it yet), but I am wondering if it will create problems further down the code...
Any advice is useful. Thanks!