Possibility to skip sampsize check for samples by.user

sibipx commented 7 months ago

Is there any possibility to skip the check on sampsize for user specified samples?

    ## ntree should be coherent with the sample provided
    ntree <- ncol(samp)
    sampsize <- colSums(samp)
    if (sum(sampsize == sampsize[1]) != ntree) {
      stop("sampsize must be identical for each tree")
    }

I have data for more patients encounters that are of different length. I want a full patient encounter to fall either in-bag or out-of-bag, because this is how in reality predictions will be made (predictions will be made for new patient encounters). Moreover, If observations from the same patient fall both in-bag and out-of-bag it will encourage the forest to "memorize" the patient (based on some important baseline features) I want to define the in-bag and out-of-bag samples based on patient encounters.

tree 1 inbag: P1 30 observations P3 10 observations ... out of bag: P2 25 observations ... tree 2 inbag: P2 25 observations P3 10 observations ... out of bag: P1 30 observations ...

I used this kind of custom sampling for ranger and worked just fine, but what I really need is to build competing risks models, for which I want to use rfsrc. (I need to make dynamic predictions for every patient day: day 1, day 2, day 3,... AND for different horizons in the future: next 7 days, next 14 days... in a competing events framework)

I could try to overwrite the function code in some way (I didn't try it yet), but I am wondering if it will create problems further down the code...

Any advice is useful. Thanks!

sibipx commented 7 months ago

I am adding a concrete example based on simulated data and binary outcome (to keep it simple)

library(randomForestSRC)
set.seed(2023)

n_patients <- 5000
n_tree <- 1000
pat_length <- sample.int(n=6, size=n_patients, replace = TRUE)

pat_id <- lapply(1:length(pat_length), function(x) rep(x, pat_length[x]))
pat_day <- lapply(1:length(pat_length), function(x) cumsum(rep(1, length(pat_id[[x]]))))
pat_id <- unlist(pat_id)
pat_day <- unlist(pat_day)
X1 <- rnorm(length(pat_id), 0, 1)
X2 <- rnorm(length(pat_id), 0, 1)
bin_outcome <- unlist(lapply(1:length(pat_length), 
                             function(x) rep(sample(c("event", "no_event"), 1, prob = c(0.1, 0.9)), pat_length[x])))
bin_outcome <- as.factor(bin_outcome)

df <- data.frame(pat_id, pat_day, X1, X2, bin_outcome)

head(df, 20)

# create inbags for ranger
create_inbag <- function(adm_ids, sample_fraction, num_trees){

  adm_ids_unique <- unique(adm_ids)
  n_id <- length(adm_ids_unique)

  inbags <- list()
  for (i in 1:num_trees){
    s <- sample(adm_ids_unique, round(n_id * sample_fraction), replace = FALSE)
    inbags[[i]] <- as.integer(adm_ids %in% s)
  }

  return(inbags)
}

inbags <- create_inbag(df$pat_id, 0.5, n_tree)
# make matrix inbags
inbags <- matrix(unlist(inbags), nrow=length(inbags[[1]])) 

RF_obj <- rfsrc(bin_outcome ~ ., df,
                ntree = n_tree,
                # specify subsamples
                bootstrap = "by.user",
                samp = inbags,
                importance = "none")

sibipx commented 7 months ago

I have meanwhile explored with 3 possible ways to go around the issue for model tuning (based on OOB error, preferably).

Build 1000 individual trees each on one of the unequal size "inbags", keep OOB predictions, average the OOB predictions to get a final OOB prediction and calculate my metrics on it. I used foreach to parallelize over 1000 trees.
- the most elegant solution
- very slow
Use cross-validation for tuning instead of OOB with custom folds, parallelized on folds
- I "waste" some sample size (I have a big sample size but a rare event so in terms of number of events in the sample size I am not so "rich")
- faster than building individual trees but still slower than OOB tuning
"Crop" the "inbags" on the minumum size of all inbags by sampling out randomly some observations
- least elegant but follows the package implementation the closest
- the observations sampled out will belong to patients in the inbag so I still have a low chance of OOB bias (but I will explore the bias)
- fastest

I have chosen to continue with the 3rd solution: "Crop" the "inbags" on the minumum size. I would appreciate if you'd share any thoughts, either on other solutions to go around the issue or on making any solution faster in terms of runtime.

I have to mention that I have a large sample size (> 150.000 observations, > 200 features) but few events of interest. I aim to compare different models (binary, multinomial, survival, competing risks) in prediction settings. I run everything on HPC, but still, it takes a while. That's why I chose the fastest solution.

Thanks!

kogalur commented 7 months ago

I think your third option is the best. Unfortunately, the requirement that the sample size be constant is hard coded on the C-side. It will be impossible to work around or remove this limitation without significant modification to the code base. We don't really have this on our list of pressing coding campaigns.

sibipx commented 7 months ago

With little disappointment in my heart, I thank you for the quick answer. I will continue with option 3. I will close the issue. Also thank you for a package with so much functionality for the user!

kogalur / randomForestSRC

Possibility to skip sampsize check for samples by.user #402