loft-br / xgboost-survival-embeddings

Improving XGBoost survival analysis with embeddings and debiased estimators
https://loft-br.github.io/xgboost-survival-embeddings/
Apache License 2.0
321 stars 53 forks source link

Ability to pass separate training data sets in for xgboost model vs survival model #56

Open crew102 opened 2 years ago

crew102 commented 2 years ago

I'm wondering if you've considered allowing the user to pass in a separate training sets for the xgboost model vs the survival model?

For example, in XGBSEStackedWeibull, the current state is this:

  1. Train xgboost on X_train, y_train
  2. Predict back on X_train using model from (1), resulting in risk scores
  3. Train Weibull AFT model with risk scores from (2) and y_train

I'm proposing this:

  1. Train xgboost on X_train, y_train
  2. Predict risk scores of X_train_2 using model from (1)
  3. Train Weibull AFT model using risk scores from (2) and y_train_2

The rationale for having different datasets used between the models is that it reduces the chance of overfitting. I've found that the risk scores that come out of step 2 are indicating a tighter relationship between risk score and y_train than there actually is, by nature of the fact that we are predicting back on the dataset that the xgboost model was trained on (and then re-relating things to the original outcome variable, y_train).

Thanks for the awesome package

davivieirab commented 2 years ago

Thanks for the suggestion, @crew102 . We are currently working on a way to replace the 1st step xgboost model for a pre-trained one. Both XGBSEDebiasedBCE and XGBSEStackedWeibull modules will be able to use this feature, which will cover your use case.