google / yggdrasil-decision-forests

A library to train, evaluate, interpret, and productionize decision forest models such as Random Forest and Gradient Boosted Decision Trees.
https://ydf.readthedocs.io/
Apache License 2.0
470 stars 49 forks source link

Cross validation does not support regression task. #115

Closed decmca closed 2 months ago

decmca commented 2 months ago

Cross validation does not support regression task. See code and error below. Same issue for Random Forest too. Removing the tuner does not make any difference.

tuner = ydf.RandomSearchTuner(num_trials=50, automatic_search_space=True)

learner = ydf.GradientBoostedTreesLearner(label="orders", task=ydf.Task.REGRESSION, tuner=tuner )

evaluation = learner.cross_validation(train_df2, folds=10)

[WARNING 24-07-02 18:33:57.6357 BST gradient_boosted_trees.cc:1840] "goss_alpha" set but "sampling_method" not equal to "GOSS". [WARNING 24-07-02 18:33:57.6357 BST gradient_boosted_trees.cc:1851] "goss_beta" set but "sampling_method" not equal to "GOSS". [WARNING 24-07-02 18:33:57.6357 BST gradient_boosted_trees.cc:1865] "selective_gradient_boosting_ratio" set but "sampling_method" not equal to "SELGB".

ValueError Traceback (most recent call last) Cell In[36], line 8 1 tuner = ydf.RandomSearchTuner(num_trials=50, automatic_search_space=True) 3 learner = ydf.GradientBoostedTreesLearner(label="orders", 4 task=ydf.Task.REGRESSION, 5 tuner=tuner 6 ) ----> 8 evaluation = learner.cross_validation(train_df2, folds=10)

File ~/miniconda3/envs/ydf/lib/python3.11/site-packages/ydf/learner/generic_learner.py:420, in GenericLearner.cross_validation(self, ds, folds, bootstrapping, parallel_evaluations) 417 learner = self._get_learner() 419 with log.cc_log_context(): --> 420 evaluation_proto = learner.Evaluate( 421 vertical_dataset._dataset, # pylint: disable=protected-access 422 fold_generator, 423 evaluation_options, 424 deployment_evaluation, 425 ) 426 return metric.Evaluation(evaluation_proto)

ValueError: INVALID_ARGUMENT: Classification requires a categorical label.

rstz commented 2 months ago

Hi, thank you for reporting. I wasn't immediately able to repro this on the adult dataset:

import ydf
import pandas as pd
ds_path = "https://raw.githubusercontent.com/google/yggdrasil-decision-forests/main/yggdrasil_decision_forests/test_data/dataset"
dataset = pd.read_csv(f"{ds_path}/adult.csv")

learner = ydf.RandomForestLearner(label="age", task=ydf.Task.REGRESSION)
evaluation = learner.cross_validation(dataset, folds=5)

This trains a regression model and works fine in colab. Can you please provide a repro?

decmca commented 2 months ago

Hi - i think this is likely to be a Macbook issue. I tried to run your code and got the same error.

Will just use Colab, thanks.

Error:


ValueError Traceback (most recent call last) Cell In[22], line 7 4 dataset = pd.read_csv(f"{ds_path}/adult.csv") 6 learner = ydf.RandomForestLearner(label="age", task=ydf.Task.REGRESSION) ----> 7 evaluation = learner.cross_validation(dataset, folds=5)

File ~/miniconda3/envs/ydf/lib/python3.11/site-packages/ydf/learner/generic_learner.py:420, in GenericLearner.cross_validation(self, ds, folds, bootstrapping, parallel_evaluations) 417 learner = self._get_learner() 419 with log.cc_log_context(): --> 420 evaluation_proto = learner.Evaluate( 421 vertical_dataset._dataset, # pylint: disable=protected-access 422 fold_generator, 423 evaluation_options, 424 deployment_evaluation, 425 ) 426 return metric.Evaluation(evaluation_proto)

ValueError: INVALID_ARGUMENT: Classification requires a categorical label.

rstz commented 2 months ago

Thanks for getting back to me, a new Mac release is scheduled for this week, so hopefully this will fix it on Mac as well.