jpmml / jpmml-sklearn

Java library and command-line application for converting Scikit-Learn pipelines to PMML
GNU Affero General Public License v3.0
531 stars 117 forks source link

Support for Scikit-Survival models that is compatible with Sklearn? #174

Open WeijiaZhang24 opened 2 years ago

WeijiaZhang24 commented 2 years ago

Is is possible to export models trained using Scikit-Survival (sksurv)? This is the repos for sksurv: https://github.com/sebp/scikit-survival

sksurv contains a RandomSurvivalForest algorithm which extend RandomForest to right-censored survival data. In standard RandomForest, the regression target y is a number, but in survival data , the labels are in the form of [time, event_indicator]. If event_indicator==1, then time is the same as y (event is observed); however, when event_indicator == 0, we only know taht y>time (event is not observed up to the observed time).

Any help would be appreciated!

vruusmann commented 2 years ago

Is is possible to export models trained using Scikit-Survival (sksurv)?

I'm going to explore your earlier XGBoost example a bit in order to gain a better understanding about the state-of-the-art in survival analysis.

The fundamental problem here is that "survival" appears to be a different endpoint than "regression".

The PMML specification does not provide a dedicated "survival" mining function type: https://dmg.org/pmml/v4-4-1/GeneralStructure.html#xsdType_MINING-FUNCTION

The obvious fix would be to define a new mining function type ourselves. I guess it's safe to say today that it's not reasonable to count on Data Mining Group's help here, because they're largely non-operational (still waiting to receive an initial feedback on some feature requests that I posted to them 1+ year ago).

sksurv contains a RandomSurvivalForest algorithm which extend RandomForest to right-censored survival data.

The JPMML-SkLearn library already provides a PMML converter for the RandomForest class.

RandomSurvivalForest and RandomForest should use identical tree ensemble data structures. Therefore, it would be build a PMML converter for RandomSurvivalForest by simply applying some post-processing to RandomForest prediction.

vruusmann commented 2 years ago

The PMML specification currently defines a "survival" endpoint for linear models (jump to the "Cox Regression Model Explanation and Examples" section): https://dmg.org/pmml/v4-4-1/GeneralRegression.html

This approach should be generalizable to other model types (eg. decision tree ensembles).

WeijiaZhang24 commented 2 years ago

I found that in an older version of the R package "pmml", it can export the Random Survival Forest consctructed by an old version of "randomForesSRC" package. I'm not sure why the later versions of pmml R package removed this function.

I can help with Python, R codes related to survival analysis, but I'm not familiar with PMML format... Here're the working R codes to replicate this older version transformer. (The document for pmml 1.5.4 can be found at https://mran.microsoft.com/snapshot/2018-02-12/web/packages/pmml/pmml.pdf"

install.packages("remotes") library("remotes") install_version("randomForestSRC", "2.5.0") install_version("pmml", "1.5.4") library(pmml) library(randomForestSRC)

data(veteran) veteran.out <- rfsrc(Surv(time, status)~., data = veteran, ntree = 5, forest = TRUE, membership = TRUE) pmml.rfsrc(veteran.out)

vruusmann commented 2 years ago

I found that in an older version of the R package "pmml", it can export the Random Survival Forest consctructed by an old version of "randomForesSRC" package.

This converter was using some proprietary super-hackish way of encoding the "survival" transformation.

Basically, it was a tool for enriching the standard randomForest object with some extra information. Just for information purposes - these models could not be evaluated by other PMML engines.

I'm just saying that it might be worthwhile to take some time and design a proper and future-proof extension to the latest PMML standard.

When speaking about RandomSurvivalForest, then I believe that the JPMML software stack can already do 90% of what is required (pre-processing, decision tree ensemble data structure). Just need to design the missing 10% part, which takes care about post-processing.