google / yggdrasil-decision-forests

A library to train, evaluate, interpret, and productionize decision forest models such as Random Forest and Gradient Boosted Decision Trees.
https://ydf.readthedocs.io/
Apache License 2.0
447 stars 49 forks source link

RandomSearchTuner with automatic search space ignores MaxDepth #98

Open TonyCongqianWang opened 3 weeks ago

TonyCongqianWang commented 3 weeks ago

I have a problem. When I use the RandomSearchTuner and fix the Learners num_trees, there is no problem and as expected every trial will have that num_trees, if I fix the max_depth however, it just gets ignored. When I use tuner.choice("max_depth", [1,2,3]), max_depths is respected during trials, but I get the following Error at the end:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../ydf/learner/specialized_learners.py", line 1548, in train
    return super().train(ds, valid)
  File "/.../ydf/learner/generic_learner.py", line 190, in train
    return self._train_from_dataset(ds, valid)
  File "/.../ydf/learner/generic_learner.py", line 241, in _train_from_dataset
    cc_model = learner.Train(**train_args)
ValueError: INVALID_ARGUMENT: The param "max_depth" is defined multiple times.
achoum commented 2 weeks ago

Hi Tony,

Sorry to hear about your issue. Can you share a snippet of the training code to help me figure the failing setup?

Also, make sure you don't have automatic_search_space=True, which will already define some of the hyper-parameter search space.

TonyCongqianWang commented 2 weeks ago

Hi,

I did use automatic_search_space=True so that's probably the cause. It would be nice though if you could manually override some choices.

Alternatively do you have some ideas how I can improve the results with max_depths=1 and high dimensional categorical features? So far I only get constant models. I am trying to use your library to replicate viola and jones face detection algorithm

achoum commented 1 week ago

YDF does not count max-depth the same way as other libraries--it is something I regret, but it is too late to change :) In other words, if you want stumps, you need to have max_depth=2 instead of max_depth=1.

To train a Viola and Jones like model with fixed thresholds, make sure to feed BOOLEAN features instead of CATEGORICAL ones. BOOLEAN features are essentially a CATEGORICAL features with only two possible values, and they are faster to train.

You could also feed NUMERICAL features and let YDF figure the thresholds.

Note also that the Viola and Jones algorithm is a boosting algorithm (such as AdaBoost) which is a bit different from a gradient boosting algorithm. However, I would expect for gradient boosting to give similar results. If you get results (good or bad), don't hesitate to share. I would like very interesting.

In this kind of situation where there have many correlated numerical features, oblique forests sometime give excellent results. It would also be interesting to try. For example, this could be:

GradientBoostedTreesLearner(split_axis="SPARSE_OBLIQUE", sparse_oblique_num_projections_exponent=1.5, sparse_oblique_max_num_projections=500, ...)
TonyCongqianWang commented 1 week ago

Thanks for your reply! That explains a lot. I was confused why my models all turned out to be constant, but it turns out max_depths=1 forces them to be constant! Maybe there should be a warning issued when people set max_depths = 1? Also There should be some hint in the Documentation about this different interpretation