google / yggdrasil-decision-forests

A library to train, evaluate, interpret, and productionize decision forest models such as Random Forest and Gradient Boosted Decision Trees.
https://ydf.readthedocs.io/
Apache License 2.0
469 stars 49 forks source link

`INVALID_ARGUMENT: Too much categorical conditions` - how many is too many? #118

Open AlexanderLavelle opened 1 month ago

AlexanderLavelle commented 1 month ago

When I try to evaluate my model or make predictions on the val set, I get the following error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[22], line 2
      1 # Evaluate a model (e.g. roc, accuracy, confusion matrix, confidence intervals)
----> 2 model.evaluate(val)

File ~/miniforge-pypy3/envs/notebook/lib/python3.10/site-packages/ydf/model/generic_model.py:434, in GenericModel.evaluate(self, data, bootstrapping, weighted)
    423     raise ValueError(
    424         "bootstrapping argument should be boolean or an integer greater"
    425         " than 100 as bootstrapping will not yield useful results. Got"
    426         f" {bootstrapping!r} instead"
    427     )
    429   options_proto = metric_pb2.EvaluationOptions(
    430       bootstrapping_samples=bootstrapping_samples,
    431       task=self.task()._to_proto_type(),  # pylint: disable=protected-access
    432   )
--> 434   evaluation_proto = self._model.Evaluate(
    435       ds._dataset, options_proto, weighted=weighted
    436   )  # pylint: disable=protected-access
    437 return metric.Evaluation(evaluation_proto)

ValueError: INVALID_ARGUMENT: Too much categorical conditions.

Here is the DataSpec from .describe()

Number of records: 10354318
Number of columns: 13

Number of columns by type:
    CATEGORICAL: 12 (92.3077%)
    NUMERICAL: 1 (7.69231%)

Columns:

CATEGORICAL: 12 (92.3077%)
    0: "Col1" CATEGORICAL has-dict vocab-size:3 zero-ood-items most-frequent:"0" 9080765 (87.7003%) dtype:DTYPE_INT64
    2: "Col2" CATEGORICAL has-dict vocab-size:3 zero-ood-items most-frequent:"1" 5605346 (54.1353%) dtype:DTYPE_BYTES
    3: "Col3" CATEGORICAL has-dict vocab-size:67 zero-ood-items most-frequent:"24" 734340 (7.09211%) dtype:DTYPE_BYTES
    4: "Col4" CATEGORICAL has-dict vocab-size:3 zero-ood-items most-frequent:"1" 10333860 (99.8024%) dtype:DTYPE_BYTES
    5: "Col5" CATEGORICAL has-dict vocab-size:54 num-oods:1 (9.65781e-06%) most-frequent:"28.0" 3105582 (29.9931%) dtype:DTYPE_BYTES
    6: "Col6" CATEGORICAL has-dict vocab-size:3 zero-ood-items most-frequent:"0" 5560031 (53.6977%) dtype:DTYPE_BYTES
    7: "Col7" CATEGORICAL has-dict vocab-size:4 zero-ood-items most-frequent:"1-2 Year" 5384018 (51.9978%) dtype:DTYPE_BYTES
    8: "Col8" CATEGORICAL has-dict vocab-size:3 zero-ood-items most-frequent:"Yes" 5204863 (50.2676%) dtype:DTYPE_BYTES
    9: "Col9" CATEGORICAL has-dict vocab-size:2001 num-oods:5846153 (56.461%) most-frequent:"<OOD>" 5846153 (56.461%) dtype:DTYPE_BYTES
    10: "Col10" CATEGORICAL has-dict vocab-size:148 num-oods:11 (0.000106236%) most-frequent:"152.0" 3750149 (36.2182%) dtype:DTYPE_BYTES
    11: "Col11" CATEGORICAL has-dict vocab-size:291 zero-ood-items most-frequent:"187" 88301 (0.852794%) dtype:DTYPE_BYTES
    12: "Col12" CATEGORICAL has-dict vocab-size:18 zero-ood-items most-frequent:"0.0" 2062871 (19.9228%) dtype:DTYPE_BYTES

NUMERICAL: 1 (7.69231%)
    1: "weights" NUMERICAL mean:4.06513 min:4.06513 max:4.06513 sd:nan dtype:DTYPE_FLOAT64

Terminology:
    nas: Number of non-available (i.e. missing) values.
    ood: Out of dictionary.
    manually-defined: Attribute whose type is manually defined by the user, i.e., the type was not automatically inferred.
    tokenized: The attribute value is obtained through tokenization.
    has-dict: The attribute is attached to a string dictionary e.g. a categorical attribute stored as a string.
    vocab-size: Number of unique values.

There are probably a lot of categories, but I would have thought that with categorical sets it would be fine?

rstz commented 1 month ago

This doesn't look right, I don't see how this would be triggered for this data spec. Can you please provide a bit more information about your dataset and hyperparameters?

AlexanderLavelle commented 1 month ago

This is just an ordinary Kaggle tabular dataset. Perhaps it's going OOM on predict?

model = (
    ydf
    .GradientBoostedTreesLearner(
        label="Response", 
        weights='weights',
        max_depth=100,
        num_trees=1_000, 
    )
    .train(train)
)

RAM is nearly full but Swap still has > 60 GB

AlexanderLavelle commented 1 month ago

With max_depth lowered to 25, I was able to get a prediction. Still curious about why 100 features in the tree is negatively impactful?

rstz commented 1 month ago

I believe that the issue is that the trees are very, very deep (esp. for GBTs). For inference, ydf transforms the model to use a buffer that contains all the categorical splits (i.e. splits of type "featureA in [featureAvalue1, featureAvalue4, ...]). This buffer can have at most std::numeric_limits::max() entries. Each split occupies 100 entries (1 per feature), so you're limited to 43 million categorical splits in this case. This sound like a lot, but if your trees have max_depth 100, you will have a lot of splits.

In C++, YDF has support for an inference engine that does not have this limitation (probably, haven't tried it). However, this engine is much slower than what we expose PYDF. It sounds like exposing the slow engine might be a useful solution for some less common models such as the one you built - I'll try to prioritize this.

AlexanderLavelle commented 1 month ago

@rstz Based on your comment, I am curious how many people would also be looking to do trees so deep. I have a different use case which has a vocabulary of up to 1m tokenized / feature set.

I suppose BQML frontend allows only for 50k. Even at 50k features, what sort of depth and width would you expect to be required? Data shape would be like 20-30m rows, 50k columns, and a binary outcome.

It might be suitable to just give a warning when max_depths > 50 or something instead of opening a feature request?

On the note of feature requests, I am curious if YDF can support multilabel outcomes (not mutually exclusive outcomes, shared tree space)?

rstz commented 1 month ago

As always in ML the answer will depend on your data, but I'll give some more-or-less educated guesses.

The theory of boosting suggests using mostly small trees (see e.g. Intro to Statistical Learning Chapter 8.2) to avoid overfitting with individual trees. In practice, we've seen YDF's default of 6 or values in the range 2-10 perform well. Small trees have the added advantage that the model is much smaller and inference can be much faster. I'd be interested how model quality changes with max_depth in your use case.

As an aside, note that we have seen GBTs perform better when ignoring the max_depth parameter altogether. Instead set growing_strategy=BEST_FIRST_GLOBAL and tune the max_num_nodes hyperparameter to control the size of the tree.

Having 50k or more features often happens when using one-hot encoding on categorical features. One-hot encoding is not recommended when using decision forests. Instead, categorical features should be fed directly. This allows the tree to perform splits on multiple categories at once (e.g. if featureA in [val1, val4, val9], go left. Go right otherwise), which improves prediction quality in nearly all cases. When feeding text, consider using categorical set features [1] or pretrained embeddings.

Re: Multi-label outcomes - can you please open a separate issue for this? I think Yggdrasil might have a solution for this, but it's probably not yet exposed in the Python API.

[1] Categorical Sets are unfortunately broken in the Python API until our pending fix to #113 has landed. The fix and a tutorial will be included in the next release.

AlexanderLavelle commented 1 month ago
The theory of boosting suggests using mostly small trees (see e.g. [Intro to Statistical Learning Chapter 8.2](https://hastie.su.domains/ISLP/ISLP_website.pdf.download.html)) to avoid overfitting with individual trees. In practice, we've seen YDF's [default of 6](https://ydf.readthedocs.io/en/latest/hyperparameters/#max_depth) or values in the range 2-10 perform well. Small trees have the added advantage that the model is much smaller and inference can be much faster. I'd be interested how model quality changes with max_depth in your use case.

Yep, this sounds about right. There is definitely something to be said about stats, but I think this library is one of the first to elegantly enable such wide (particularly highly sparse) data sets. In that sense I am very curious going forward about how tens to hundreds of thousands of features affect desirable tree depths.

Ohhh sure growing strategy definitely makes the MOST sense.

Yes, I have kept features relatively sparse by avoiding OHE. Unfortunate that Python API categorical sets aren't working -- this is an incredibly important feature of ydf!

rstz commented 3 weeks ago

Categorical Sets are now working in 0.7.0 - we plan to publish guides in the near future.