slds-lmu / paper_2023_survival_benchmark

Benchmark for Burk et al. (2024)
https://projects.lukasburk.de/survival_benchmark/
GNU General Public License v3.0
4 stars 0 forks source link
machine-learning-benchmark model-evaluation-and-tuning survival-analysis

A Large-Scale Neutral Comparison Study of Survival Models on Low-Dimensional Data

L. Burk, J. Zobolas, B. Bischl, A. Bender, M. N. Wright, and R. Sonabend, “A Large-Scale Neutral Comparison Study of Survival Models on Low-Dimensional Data.” arXiv, Jun. 06, 2024. doi: 10.48550/arXiv.2406.04098.

:warning: A note on versioning: This repository is actively in development. To view its state at the time of submission of the arXiv preprint, please browse this tag on GitHub

Setup

The benchmark is conducted using R and the mlr3 framework. The following files are necessary to set up the benchmark:

Reproducibility

Please note that due to the large file sizes of the BenchmarkResult (bmr) objects produced by the aggregation of the batchtools registry, this repository only contains the processed result files (./results/registry_beartooth/) required to produce the main results of the paper.

Results

Results are available online at projects.lukasburk.de/survival_benchmark/

The site is generated from the quarto site in ./site/.

Datasets

The datasets used in the benchmark are stored after minor modifications in ./datasets/ and are also uploaded to OpenML. The dataset's names, source package, and OpenML dataset IDs are stored in ./dataset_table.[csv|rds].

Here is a short example on how to download the datasets from OpenML using {mlr3oml}:

# Get dataset from openml
library(mlr3oml)
library(mlr3proba) # To create survival tasks via as_task_surv

# The 'qs' package is required for caching the downloaded data
if (requireNamespace("qs", quietly = TRUE)) {
  options(mlr3oml.cache = TRUE)
}

# Get the table of datasets & their OpenML IDs
dataset_tbl = readRDS(here::here("dataset_table.rds"))
head(dataset_tbl[, c("dataset", "dataset_id")])
   dataset dataset_id
1     gbsg      46131
2 metabric      46142
3  support      46144
4   colrec      46145
5    rdata      46146
6  aids.id      46130
# Get an individual dataset in the OMLData class
colrec_odt = mlr3oml::odt(46145)

# Convert the OMLData object to a TaskSurv object in a loop, creating a list of mlr3 TaskSurv objects
task_list = lapply(dataset_tbl$dataset_id, function(id) {
  dat = mlr3oml::odt(id)

  task = mlr3proba::as_task_surv(mlr3::as_data_backend(dat), target = "time", event = "status", id = dat$name)
  task$set_col_roles("status", add_to = "stratum")
  Sys.sleep(0.1) # Small timeout to not hammer the OML server
  task
})

task_list[[1]]
<TaskSurv:gbsg> (2232 x 9)
* Target: time, status
* Properties: strata
* Features (7):
  - int (7): x0, x1, x2, x3, x4, x5, x6
* Strata: status