loft-br / xgboost-survival-embeddings

Improving XGBoost survival analysis with embeddings and debiased estimators
https://loft-br.github.io/xgboost-survival-embeddings/
Apache License 2.0
313 stars 51 forks source link

xgboost 1.4.0+: ValueError: If using all scalar values, you must pass an index #31

Closed jacobgqc closed 3 years ago

jacobgqc commented 3 years ago

Using xgboost 1.4.0 or 1.4.1, we are now getting an error: ValueError: If using all scalar values, you must pass an index

No error with 1.3.3

All releases after 1.3.3, we're receiving a ValueError upon XGBSEBootstrapEstimator.fit() call. Tested in Python 3.7.2 and 3.8.6

Trace:

C:\Users\jacob\Envs\xgbse_gqc_38\lib\site-packages\xgboost\core.py:101: UserWarning: ntree_limit is deprecated, use iteration_range or model slicing instead.
  warnings.warn(
Traceback (most recent call last):
  File "F:/git/gqc/pipe_breaks/script_runner.py", line 111, in <module>
    main()
  File "F:/git/gqc/pipe_breaks/script_runner.py", line 100, in main
    script_module.main()
  File "F:\git\gqc\pipe_breaks\algorithms\xgbse_gqc.py", line 271, in main
    do_extrapolation(X=X, X_valid=X_valid, X_train=X_train, y_train=y_train, main_ids=id_column)
  File "F:\git\gqc\pipe_breaks\algorithms\xgbse_gqc.py", line 122, in do_extrapolation
    bootstrap_estimator, mean, upper_ci, lower_ci = fit_predict_bootstrap_est(
  File "F:\git\gqc\pipe_breaks\algorithms\xgbse_gqc.py", line 208, in fit_predict_bootstrap_est
    bootstrap_estimator.fit(
  File "C:\Users\jacob\Envs\xgbse_gqc_38\lib\site-packages\xgbse\_meta.py", line 57, in fit
    trained_model = self.base_estimator.fit(X_sample, y_sample, **kwargs)
  File "C:\Users\jacob\Envs\xgbse_gqc_38\lib\site-packages\xgbse\_kaplan_neighbors.py", line 407, in fit
    pd.DataFrame({"leaf": leaves})
  File "C:\Users\jacob\Envs\xgbse_gqc_38\lib\site-packages\pandas\core\frame.py", line 467, in __init__
    mgr = init_dict(data, index, columns, dtype=dtype)
  File "C:\Users\jacob\Envs\xgbse_gqc_38\lib\site-packages\pandas\core\internals\construction.py", line 283, in init_dict
    return arrays_to_mgr(arrays, data_names, index, columns, dtype=dtype)
  File "C:\Users\jacob\Envs\xgbse_gqc_38\lib\site-packages\pandas\core\internals\construction.py", line 78, in arrays_to_mgr
    index = extract_index(arrays)
  File "C:\Users\jacob\Envs\xgbse_gqc_38\lib\site-packages\pandas\core\internals\construction.py", line 387, in extract_index
    raise ValueError("If using all scalar values, you must pass an index")
ValueError: If using all scalar values, you must pass an index

Throwing code block:

def fit_predict_bootstrap_est(base_model, n_estimators, X_train, y_train, X_valid):
    """Instantiate, fit, and predict a bootstrap_estimator."""
    bootstrap_estimator = XGBSEBootstrapEstimator(base_model, n_estimators=n_estimators)
    bootstrap_estimator.fit(
        X_train,
        y_train,
        time_bins=TIME_BINS,
    )

I'm unable to share specific data of the train structures, but their types and shapes follow: X_train = DataFrame: (2916, 11) y_train = ndarray: (4916,) TIME_BINS = np.arange(5, 540, 5)

Requirements: astor==0.8.1 autograd==1.3 autograd-gamma==0.5.0 backcall==0.2.0 colorama==0.4.4 cycler==0.10.0 decorator==5.0.7 ecos==2.0.7.post1 formulaic==0.2.3 future==0.18.2 interface-meta==1.2.3 ipykernel==5.5.3 ipython==7.22.0 ipython-genutils==0.2.0 jedi==0.18.0 joblib==1.0.1 jupyter-client==6.1.12 jupyter-core==4.7.1 kiwisolver==1.3.1 lifelines==0.25.11 matplotlib==3.3.0 numexpr==2.7.3 numpy==1.20.2 osqp==0.6.2.post0 pandas==1.1.0 parso==0.8.2 pickleshare==0.7.5 Pillow==8.2.0 prompt-toolkit==3.0.18 Pygments==2.8.1 pyparsing==2.4.7 python-dateutil==2.8.1 pytz==2021.1 pywin32==300 pyzmq==22.0.3 qdldl==0.1.5.post0 scikit-learn==0.24.1 scikit-survival==0.15.0.post0 scipy==1.6.2 six==1.15.0 threadpoolctl==2.1.0 toml==0.10.2 tornado==6.1 traitlets==5.0.5 wcwidth==0.2.5 wrapt==1.12.1 xgboost==1.3.3 # xgboost==1.4.1 xgbse==0.2.1

trivialfis commented 3 years ago

Hi, sorry for the change. The output of the predict function is changed for consistent shape. Before it's (n_samples, ), and now it's (n_samples, 1). We didn't anticipate this to be a problem. The quickest fix is just to call np.reshape and get rid of the last dimension. I'm not sure should I change it for all the calls to predict function, or should we revert the change in output shape.

@hcho3 .

trivialfis commented 3 years ago

See https://github.com/dmlc/xgboost/pull/6889 . In general I recommend using the strict_shape parameter for xgboost 1.4.x

GabrielGimenez commented 3 years ago

Thanks @jacobgqc for your report and thank you very much @trivialfis for the help in finding the cause of the issue. We'll proceed with just the reshape fix and look into using strict_shape for the next version.

trivialfis commented 3 years ago

Thanks! I opened a PR in xgboost to revert the change, see above link. If it's merged then we don't need any change in this project.

trivialfis commented 3 years ago

See https://github.com/dmlc/xgboost/issues/6920 .

If everything goes well I should prepare the release next week.

trivialfis commented 3 years ago

Hi, sorry for the long delay. 1.4.2 is out today.

GabrielGimenez commented 3 years ago

Thanks @trivialfis for the communication and the help with this issue, it's fixed in xgboost 1.4.2