Prediction is extremly slow for categorical decision stumps

TonyCongqianWang commented 3 months ago

I have categorial features with 256 possible values each. I train a small forest with ~10 non oblique decision stumps. Each trees thus has only one decision node based on exactly one feature.

The model trains somewhat slowly, but what suprises me is how slow the prediction method runs. I don't know the exact implementation but it seems that each tree node is represented as a list which is extremply slow. Just a few samples take multiple seconds to predict which is much slower than I would expect even with an naive implementation. If I get the feature values of the samples by hand from the dataframe and then check whether they are contained in the tree nodes positive set, it is faster than the predict method

rstz commented 2 months ago

Hi, apologies for the late reply. I will try to reproduce this - it very much sounds like a bug somewhere. If you can provide any more details on the hyperparameters you're using, I'd be happy to know them

TonyCongqianWang commented 2 months ago

hyperparameters: {'max_depth' : 2, 'num_trees' : 100, 'shrinkage': 0.8, 'subsample': 0.6, 'use_hessian_gain': 'true', 'categorical_algorithm': 'CART'} but more importantly all features are set to categorical features=[ydf.Feature(feat, ydf.Semantic.CATEGORICAL) for feat in train_ds.columns[1:]] with 256 possible values. I would still expect a tree stump to be evaluated in less than a microsecond. Instead it takes very long to just predict a couple thosand samples. If you want I can send you a saved model

TonyCongqianWang commented 2 months ago

One tree looks like this:


    "feature_4256" is in [BITMAP] {<OOD>, 227, 195, 193, 135, 131, 129, 243, 251, 199, ...[83 left]} [s:7967.18 n:79499 np:54350 miss:1] ; pred:-0.774946
        ├─(pos)─ pred:-0.450081
        └─(neg)─ pred:-1.16035```
but even with just 13 trees. Prediction can take quite long

rstz commented 2 months ago

Hi,

I cannot reproduce this yet in Colab, so let's try to dig deeper here. I'm training and predicting 10K rows and 5000 features, no validation data / no early stopping with 100 trees.

import ydf
import numpy as np

# Some configuration
NUM_TRAINING_ROWS = 10000
NUM_INFERENCE_ROWS = 10000
NUM_FEATURES = 5000
NUM_CATEGORIES = 256

train_ds = {f"feature_{i}": np.random.choice(np.arange(NUM_CATEGORIES), NUM_TRAINING_ROWS) for i in range(NUM_FEATURES)}
train_ds['label'] = np.random.randint(2, size=NUM_TRAINING_ROWS)
inference_ds = {f"feature_{i}": np.random.choice(np.arange(NUM_CATEGORIES), NUM_INFERENCE_ROWS) for i in range(NUM_FEATURES)}

model = ydf.GradientBoostedTreesLearner(label="label",
                                        features=[(f"feature_{i}",ydf.Semantic.CATEGORICAL) for i in range(NUM_FEATURES)],
                                        min_vocab_frequency=1,
                                        max_depth= 2,
                                        num_trees=100,
                                        shrinkage=0.8,
                                        subsample=0.6,
                                        use_hessian_gain=True,
                                        categorical_algorithm="CART",
                                        validation_ratio=0.0).train(train_ds)

Training the model takes ~half a minute (note the reported time is wrong, because it excludes dataset loading)

Train model on 10000 examples
Model trained in 0:00:17.776869

and the forest consists of 100 stumps. Prediction time is

%timeit model.predict(inference_ds)
# 19.9 s ± 890 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The tree seems to look like yours (e.g. model.print_tree(5)):

'feature_1820' in ['<OOD>', '49', '166', '48', '223', '41', '142', '32', '210', '246', '107', '234', '247', '42', '99', '127', '149', '156', '255', '61', '75', '11', '115', '12', '128', '184', '228', '235', '241', '103', '106', '159', '179', '140', '153', '16', '163', '178', '215', '217', '243', '51', '79', '109', '132', '146', '15', '194', '218', '22', '248', '40', '5', '87', '10', '173', '201', '21', '225', '237', '43', '44', '47', '65', '100', '122', '139', '141', '150', '151', '176', '18', '187', '19', '213', '224', '229', '254', '29', '36', '58', '59', '63', '81', '129', '167', '192', '203', '208', '238', '26', '46', '85', '1', '110', '120', '134', '175', '195', '196', '197', '216', '231', '25', '35', '39', '7', '77', '84', '96', '123', '124', '131', '135', '145', '154', '155'] [score=22.527 missing=True]
    ├─(pos)─ value=-0.078142
    └─(neg)─ value=0.11653

A significant amount of time is spent getting the datasets in YDF format. Note that inference_ds is 400 MB large, so this isn't amazing, but mostly ok IMO.

We can strip out this conversion by creating a Vertical Dataset (YDFs internal format) separately:

%timeit inference_vds = ydf.create_vertical_dataset(inference_ds, columns=[(f"feature_{i}", ydf.Semantic.CATEGORICAL) for i in range(NUM_FEATURES)],  min_vocab_frequency=1)
# 20.2 s ± 655 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Prediction on the VerticalDataset takes less time as expected:

%timeit model.predict(inference_vds)
# 3.07 s ± 290 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Do you have an idea where my experiment diverges from your setup?

achoum commented 2 months ago

In the example you give, we can see that the features are treated as categorical features instead of numerical features.

For instances, "feature_4256" is in [BITMAP] {<OOD>, 227, 195, 193, ... is a condition on categorical features. You can verify this by looking at the "Dataspec" tab in model.describe().

In your case, if the column was treated numerically, each numerical split can have 255 possible conditions (e.g. value >= threshold). However, if the column is treated categorically, each split can have 2^255-1 ~= 5.7896045e+76 conditions :) While YDF is very efficient at scanning those conditions, it is still slower than for numerical features :).

Make sure to use ydf.Feature(feat, ydf.Semantic.NUMERICAL) instead of ydf.Feature(feat, ydf.Semantic.CATEGORICAL). Or equivalently, don't set the features semantic :).

Bonus: When training a model with discretize_numerical_columns=True (default to False), the feature will be discretized into num_discretized_numerical_bins features (default 255). Discretized numerical features are different from categorical features (there are still only 255 possible conditions). If the number of unique values is less or equal to num_discretized_numerical_bins (this is your case), the learning algorithm will produce the same results as if discretize_numerical_columns=True. However, the training will be faster and consume less memory (espetially for large datasets).

TonyCongqianWang commented 2 months ago

Unfortunatly I do need the featues to be categorical. I can understand that the training takes a long time and I have no problem with that. The problem (probably a bug) is that model.predict also takes a very long time when it should be quite trivial to evaluate this fast! (It is the same kind of trees that the cv2.cascade uses and they are evaluated in a fraction of a microsecond)

TonyCongqianWang commented 2 months ago

@rstz Thank you so much for your efforts! I think your results do match up with mine. But I would have expected the prediction to be much much faster. After all, it is the same model opencv uses as their cascade classifier in their object detection pyramid. One 200x200 image is evaluated in a few milliseconds which is about the same order of magnitude as 10k samples. This is about 10000 times after than 20s! So that is why I considered this time "very long". I didnt know about ydf.create_vertical_dataset so I will use it in the future, since I don't really process the dataframe during or after training. But even then it is quite slow

google / yggdrasil-decision-forests

Prediction is extremly slow for categorical decision stumps #109