nyanp / nyaggle

Code for Kaggle and Offline Competitions
MIT License
292 stars 29 forks source link

cross_validate don't work with LightGBM v4.0.0 #112

Open yuta100101 opened 1 year ago

yuta100101 commented 1 year ago

Thanks for publishing such a useful tool!

A few days ago, LightGBM's new version 4.0.0 has been released.
In this release, early_stopping_rounds argument in fit() was removed.

So, functions that use cross_validate() such as run_experiment don't work. (There may be other functions that don't work, I haven't investigated yet.)

Of cource, there is no probrem with versions before 3.3.5.

pytest log ``` (nyaggle) yuta100101:~/nyaggle(master =)$ pytest tests/validation/test_cross_validate.py::test_cv_lgbm ========================================================================================== test session starts =========================================================================================== platform linux -- Python 3.9.17, pytest-7.4.0, pluggy-1.2.0 rootdir: /home/yuta100101/practice/nyaggle collected 1 item tests/validation/test_cross_validate.py F [100%] ================================================================================================ FAILURES ================================================================================================ ______________________________________________________________________________________________ test_cv_lgbm ______________________________________________________________________________________________ def test_cv_lgbm(): X, y = make_classification(n_samples=1024, n_features=20, class_sep=0.98, random_state=0) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0) models = [LGBMClassifier(n_estimators=300) for _ in range(5)] > pred_oof, pred_test, scores, importance = cross_validate(models, X_train, y_train, X_test, cv=5, eval_func=roc_auc_score, fit_params={'early_stopping_rounds': 200}) tests/validation/test_cross_validate.py:52: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ estimator = [LGBMClassifier(n_estimators=300), LGBMClassifier(n_estimators=300), LGBMClassifier(n_estimators=300), LGBMClassifier(n_estimators=300), LGBMClassifier(n_estimators=300)] X_train = 0 1 2 3 4 5 6 7 8 ... 11 12... ... -0.109782 -0.412230 1.707714 -0.240937 -0.276747 0.481276 -0.278111 1.304773 -0.139538 [512 rows x 20 columns] y = 0 0 1 0 2 0 3 1 4 0 .. 507 0 508 1 509 0 510 1 511 0 Name: target, Length: 512, dtype: int64 X_test = 0 1 2 3 4 5 6 7 8 ... 11 12... ... -2.598922 -0.351561 0.233836 -1.873634 -1.089221 0.373956 -0.520939 -0.489945 2.452996 [512 rows x 20 columns] cv = KFold(n_splits=5, random_state=0, shuffle=True), groups = None, eval_func = , logger = on_each_fold = None, fit_params = {'early_stopping_rounds': 200}, importance_type = 'gain', early_stopping = True, type_of_target = 'binary' def cross_validate(estimator: Union[BaseEstimator, List[BaseEstimator]], X_train: Union[pd.DataFrame, np.ndarray], y: Union[pd.Series, np.ndarray], X_test: Union[pd.DataFrame, np.ndarray] = None, cv: Optional[Union[int, Iterable, BaseCrossValidator]] = None, groups: Optional[pd.Series] = None, eval_func: Optional[Callable] = None, logger: Optional[Logger] = None, on_each_fold: Optional[Callable[[int, BaseEstimator, pd.DataFrame, pd.Series], None]] = None, fit_params: Optional[Union[Dict[str, Any], Callable]] = None, importance_type: str = 'gain', early_stopping: bool = True, type_of_target: str = 'auto') -> CVResult: """ Evaluate metrics by cross-validation. It also records out-of-fold prediction and test prediction. Args: estimator: The object to be used in cross-validation. For list inputs, ``estimator[i]`` is trained on i-th fold. X_train: Training data y: Target X_test: Test data (Optional). If specified, prediction on the test data is performed using ensemble of models. cv: int, cross-validation generator or an iterable which determines the cross-validation splitting strategy. - None, to use the default ``KFold(5, random_state=0, shuffle=True)``, - integer, to specify the number of folds in a ``(Stratified)KFold``, - CV splitter (the instance of ``BaseCrossValidator``), - An iterable yielding (train, test) splits as arrays of indices. groups: Group labels for the samples. Only used in conjunction with a “Group” cv instance (e.g., ``GroupKFold``). eval_func: Function used for logging and returning scores logger: logger on_each_fold: called for each fold with (idx_fold, model, X_fold, y_fold) fit_params: Parameters passed to the fit method of the estimator importance_type: The type of feature importance to be used to calculate result. Used only in ``LGBMClassifier`` and ``LGBMRegressor``. early_stopping: If ``True``, ``eval_set`` will be added to ``fit_params`` for each fold. ``early_stopping_rounds = 100`` will also be appended to fit_params if it does not already have one. type_of_target: The type of target variable. If ``auto``, type is inferred by ``sklearn.utils.multiclass.type_of_target``. Otherwise, ``binary``, ``continuous``, or ``multiclass`` are supported. Returns: Namedtuple with following members * oof_prediction (numpy array, shape (len(X_train),)): The predicted value on put-of-Fold validation data. * test_prediction (numpy array, hape (len(X_test),)): The predicted value on test data. ``None`` if X_test is ``None``. * scores (list of float, shape (nfolds+1,)): ``scores[i]`` denotes validation score in i-th fold. ``scores[-1]`` is the overall score. `None` if eval is not specified. * importance (list of pandas DataFrame, shape (nfolds,)): ``importance[i]`` denotes feature importance in i-th fold model. If the estimator is not GBDT, empty array is returned. Example: >>> from sklearn.datasets import make_regression >>> from sklearn.linear_model import Ridge >>> from sklearn.metrics import mean_squared_error >>> from nyaggle.validation import cross_validate >>> X, y = make_regression(n_samples=8) >>> model = Ridge(alpha=1.0) >>> pred_oof, pred_test, scores, _ = \ >>> cross_validate(model, >>> X_train=X[:3, :], >>> y=y[:3], >>> X_test=X[3:, :], >>> cv=3, >>> eval_func=mean_squared_error) >>> print(pred_oof) [-101.1123267 , 26.79300693, 17.72635528] >>> print(pred_test) [-10.65095894 -12.18909059 -23.09906427 -17.68360714 -20.08218267] >>> print(scores) [71912.80290003832, 15236.680239881942, 15472.822033121925, 34207.43505768073] """ cv = check_cv(cv, y) n_output_cols = 1 if type_of_target == 'auto': type_of_target = multiclass.type_of_target(y) if type_of_target == 'multiclass': n_output_cols = y.nunique(dropna=True) if isinstance(estimator, list): assert len(estimator) == cv.get_n_splits(), "Number of estimators should be same to nfolds." X_train = convert_input(X_train) y = convert_input_vector(y, X_train.index) if X_test is not None: X_test = convert_input(X_test) if not isinstance(estimator, list): estimator = [estimator] * cv.get_n_splits() assert len(estimator) == cv.get_n_splits() if logger is None: logger = getLogger(__name__) def _predict(model: BaseEstimator, x: pd.DataFrame, _type_of_target: str): if _type_of_target in ('binary', 'multiclass'): if hasattr(model, "predict_proba"): proba = model.predict_proba(x) elif hasattr(model, "decision_function"): warnings.warn('Since {} does not have predict_proba method, ' 'decision_function is used for the prediction instead.'.format(type(model))) proba = model.decision_function(x) else: raise RuntimeError('Estimator in classification problem should have ' 'either predict_proba or decision_function') if proba.ndim == 1: return proba else: return proba[:, 1] if proba.shape[1] == 2 else proba else: return model.predict(x) oof = np.zeros((len(X_train), n_output_cols)) if n_output_cols > 1 else np.zeros(len(X_train)) evaluated = np.full(len(X_train), False) test = None if X_test is not None: test = np.zeros((len(X_test), n_output_cols)) if n_output_cols > 1 else np.zeros(len(X_test)) scores = [] eta_all = [] importance = [] for n, (train_idx, valid_idx) in enumerate(cv.split(X_train, y, groups)): start_time = time.time() train_x, train_y = X_train.iloc[train_idx], y.iloc[train_idx] valid_x, valid_y = X_train.iloc[valid_idx], y.iloc[valid_idx] if fit_params is None: fit_params_fold = {} elif callable(fit_params): fit_params_fold = fit_params(n, train_idx, valid_idx) else: fit_params_fold = copy.copy(fit_params) if is_gbdt_instance(estimator[n], ('lgbm', 'cat', 'xgb')): if early_stopping: if 'eval_set' not in fit_params_fold: fit_params_fold['eval_set'] = [(valid_x, valid_y)] if 'early_stopping_rounds' not in fit_params_fold: fit_params_fold['early_stopping_rounds'] = 100 > estimator[n].fit(train_x, train_y, **fit_params_fold) E TypeError: fit() got an unexpected keyword argument 'early_stopping_rounds' nyaggle/validation/cross_validate.py:177: TypeError ======================================================================================== short test summary info ========================================================================================= FAILED tests/validation/test_cross_validate.py::test_cv_lgbm - TypeError: fit() got an unexpected keyword argument 'early_stopping_rounds' =========================================================================================== 1 failed in 1.90s ============================================================================================ ``` <\details>
nyanp commented 1 year ago

@yuta100101 Thank you for reporting! It should be replaced with callback API.

wakame1367 commented 1 year ago

As a temporary measure, I have set a version constraint on the installation of LightGBM. The version has been limited to LightGBM<4.0.0. I plan to address the main fix for this bug in a separate pull request.

wakame1367 commented 1 year ago

Here is an article that may be helpful in resolving this issue. Qiita - LightGBMのearly_stoppingの仕様が変わったので、使用法を調べてみた

yuta100101 commented 1 year ago

Not only cross_validate() but also find_best_lgbm_parameter() is affected, so it might be better to modify this library after Optuna's support for LightGBM 4.0.0 (Probably the PRs shown below) has been released.

yuta100101 commented 1 year ago

Sorry for the lack of words, find_best_lgbm_parameter() is affected by removing fobj argument of train().