google / yggdrasil-decision-forests

A library to train, evaluate, interpret, and productionize decision forest models such as Random Forest and Gradient Boosted Decision Trees.
Apache License 2.0
470 stars 49 forks source link

Cross validation does not support regression task. #115

Closed decmca closed 2 months ago

decmca commented 2 months ago

Cross validation does not support regression task. See code and error below. Same issue for Random Forest too. Removing the tuner does not make any difference.

tuner = ydf.RandomSearchTuner(num_trials=50, automatic_search_space=True)

learner = ydf.GradientBoostedTreesLearner(label="orders", task=ydf.Task.REGRESSION, tuner=tuner )

evaluation = learner.cross_validation(train_df2, folds=10)

[WARNING 24-07-02 18:33:57.6357 BST] "goss_alpha" set but "sampling_method" not equal to "GOSS". [WARNING 24-07-02 18:33:57.6357 BST] "goss_beta" set but "sampling_method" not equal to "GOSS". [WARNING 24-07-02 18:33:57.6357 BST] "selective_gradient_boosting_ratio" set but "sampling_method" not equal to "SELGB".

ValueError Traceback (most recent call last) Cell In[36], line 8 1 tuner = ydf.RandomSearchTuner(num_trials=50, automatic_search_space=True) 3 learner = ydf.GradientBoostedTreesLearner(label="orders", 4 task=ydf.Task.REGRESSION, 5 tuner=tuner 6 ) ----> 8 evaluation = learner.cross_validation(train_df2, folds=10)

File ~/miniconda3/envs/ydf/lib/python3.11/site-packages/ydf/learner/, in GenericLearner.cross_validation(self, ds, folds, bootstrapping, parallel_evaluations) 417 learner = self._get_learner() 419 with log.cc_log_context(): --> 420 evaluation_proto = learner.Evaluate( 421 vertical_dataset._dataset, # pylint: disable=protected-access 422 fold_generator, 423 evaluation_options, 424 deployment_evaluation, 425 ) 426 return metric.Evaluation(evaluation_proto)

ValueError: INVALID_ARGUMENT: Classification requires a categorical label.

rstz commented 2 months ago

Hi, thank you for reporting. I wasn't immediately able to repro this on the adult dataset:

import ydf
import pandas as pd
ds_path = ""
dataset = pd.read_csv(f"{ds_path}/adult.csv")

learner = ydf.RandomForestLearner(label="age", task=ydf.Task.REGRESSION)
evaluation = learner.cross_validation(dataset, folds=5)

This trains a regression model and works fine in colab. Can you please provide a repro?

decmca commented 2 months ago

Hi - i think this is likely to be a Macbook issue. I tried to run your code and got the same error.

Will just use Colab, thanks.


ValueError Traceback (most recent call last) Cell In[22], line 7 4 dataset = pd.read_csv(f"{ds_path}/adult.csv") 6 learner = ydf.RandomForestLearner(label="age", task=ydf.Task.REGRESSION) ----> 7 evaluation = learner.cross_validation(dataset, folds=5)

File ~/miniconda3/envs/ydf/lib/python3.11/site-packages/ydf/learner/, in GenericLearner.cross_validation(self, ds, folds, bootstrapping, parallel_evaluations) 417 learner = self._get_learner() 419 with log.cc_log_context(): --> 420 evaluation_proto = learner.Evaluate( 421 vertical_dataset._dataset, # pylint: disable=protected-access 422 fold_generator, 423 evaluation_options, 424 deployment_evaluation, 425 ) 426 return metric.Evaluation(evaluation_proto)

ValueError: INVALID_ARGUMENT: Classification requires a categorical label.

rstz commented 2 months ago

Thanks for getting back to me, a new Mac release is scheduled for this week, so hopefully this will fix it on Mac as well.