Missing data in input features

dpellow commented 1 year ago

I thought that Random forests should be able to handle missing data, but when I train an RSF model with some feature values missing it produces the error

Traceback (most recent call last):
  File "/home/davidpel/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/home/davidpel/anaconda3/lib/python3.9/site-packages/sklearn/pipeline.py", line 405, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "/home/davidpel/anaconda3/lib/python3.9/site-packages/sksurv/ensemble/forest.py", line 89, in fit
    X = self._validate_data(X, dtype=DTYPE, accept_sparse="csc", ensure_min_samples=2)
  File "/home/davidpel/anaconda3/lib/python3.9/site-packages/sklearn/base.py", line 546, in _validate_data
    X = check_array(X, input_name="X", **check_params)
  File "/home/davidpel/anaconda3/lib/python3.9/site-packages/sklearn/utils/validation.py", line 921, in check_array
    _assert_all_finite(
  File "/home/davidpel/anaconda3/lib/python3.9/site-packages/sklearn/utils/validation.py", line 161, in _assert_all_finite
    raise ValueError(msg_err)
ValueError: Input X contains NaN.
RandomSurvivalForest does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values
python-BaseException

Does the RSF implementation not handle missing values in the input features? Do any of the other models support them?

Version info:

System:
    python: 3.9.13 (main, Aug 25 2022, 23:26:10)  [GCC 11.2.0]
   machine: Linux-3.10.0-514.2.2.el7.x86_64-x86_64-with-glibc2.17
Python dependencies:
      sklearn: 1.2.1
          pip: 22.3.1
   setuptools: 65.6.3
        numpy: 1.23.5
        scipy: 1.10.0
       Cython: None
       pandas: 1.5.3
   matplotlib: 3.7.0
       joblib: 1.1.1
threadpoolctl: 2.2.0
sksurv: 0.20.0
ModuleNotFoundError: No module named 'cvxopt'
ModuleNotFoundError: No module named 'cvxpy'
numexpr: 2.8.4
osqp: 0.6.2

dpellow commented 1 year ago

Note - this is being enforced by an assertion in sklearn's _validate_data. If the model actually is meant to allow for NaNs you could set force_all_finite="allow-nan" in the _validate_data function (line 89 of forest.py)

sebp commented 1 year ago

Theoretically random forest could handle missing values with surrogate splits, however this isn't implemented in sksurv, and AFAIK in sklearn neither. Currently, sksurv includes no model that can deal with missing values implicitly.

sebp / scikit-survival

Missing data in input features #344