Open AlexanderLavelle opened 1 month ago
This doesn't look right, I don't see how this would be triggered for this data spec. Can you please provide a bit more information about your dataset and hyperparameters?
This is just an ordinary Kaggle tabular dataset. Perhaps it's going OOM on predict?
model = (
ydf
.GradientBoostedTreesLearner(
label="Response",
weights='weights',
max_depth=100,
num_trees=1_000,
)
.train(train)
)
RAM is nearly full but Swap still has > 60 GB
With max_depth lowered to 25, I was able to get a prediction. Still curious about why 100 features in the tree is negatively impactful?
I believe that the issue is that the trees are very, very deep (esp. for GBTs). For inference, ydf transforms the model to use a buffer that contains all the categorical splits (i.e. splits of type "featureA in [featureAvalue1, featureAvalue4, ...]). This buffer can have at most std::numeric_limits
In C++, YDF has support for an inference engine that does not have this limitation (probably, haven't tried it). However, this engine is much slower than what we expose PYDF. It sounds like exposing the slow engine might be a useful solution for some less common models such as the one you built - I'll try to prioritize this.
@rstz Based on your comment, I am curious how many people would also be looking to do trees so deep. I have a different use case which has a vocabulary of up to 1m tokenized / feature set.
I suppose BQML frontend allows only for 50k. Even at 50k features, what sort of depth and width would you expect to be required? Data shape would be like 20-30m rows, 50k columns, and a binary outcome.
It might be suitable to just give a warning when max_depths > 50
or something instead of opening a feature request?
On the note of feature requests, I am curious if YDF can support multilabel outcomes (not mutually exclusive outcomes, shared tree space)?
As always in ML the answer will depend on your data, but I'll give some more-or-less educated guesses.
The theory of boosting suggests using mostly small trees (see e.g. Intro to Statistical Learning Chapter 8.2) to avoid overfitting with individual trees. In practice, we've seen YDF's default of 6 or values in the range 2-10 perform well. Small trees have the added advantage that the model is much smaller and inference can be much faster. I'd be interested how model quality changes with max_depth in your use case.
As an aside, note that we have seen GBTs perform better when ignoring the max_depth parameter altogether. Instead set growing_strategy=BEST_FIRST_GLOBAL
and tune the max_num_nodes hyperparameter to control the size of the tree.
Having 50k or more features often happens when using one-hot encoding on categorical features. One-hot encoding is not recommended when using decision forests. Instead, categorical features should be fed directly. This allows the tree to perform splits on multiple categories at once (e.g. if featureA in [val1, val4, val9], go left. Go right otherwise
), which improves prediction quality in nearly all cases. When feeding text, consider using categorical set features [1] or pretrained embeddings.
Re: Multi-label outcomes - can you please open a separate issue for this? I think Yggdrasil might have a solution for this, but it's probably not yet exposed in the Python API.
[1] Categorical Sets are unfortunately broken in the Python API until our pending fix to #113 has landed. The fix and a tutorial will be included in the next release.
The theory of boosting suggests using mostly small trees (see e.g. [Intro to Statistical Learning Chapter 8.2](https://hastie.su.domains/ISLP/ISLP_website.pdf.download.html)) to avoid overfitting with individual trees. In practice, we've seen YDF's [default of 6](https://ydf.readthedocs.io/en/latest/hyperparameters/#max_depth) or values in the range 2-10 perform well. Small trees have the added advantage that the model is much smaller and inference can be much faster. I'd be interested how model quality changes with max_depth in your use case.
Yep, this sounds about right. There is definitely something to be said about stats, but I think this library is one of the first to elegantly enable such wide (particularly highly sparse) data sets. In that sense I am very curious going forward about how tens to hundreds of thousands of features affect desirable tree depths.
Ohhh sure growing strategy definitely makes the MOST sense.
Yes, I have kept features relatively sparse by avoiding OHE. Unfortunate that Python API categorical sets aren't working -- this is an incredibly important feature of ydf!
Categorical Sets are now working in 0.7.0 - we plan to publish guides in the near future.
When I try to evaluate my model or make predictions on the val set, I get the following error:
Here is the DataSpec from .describe()
There are probably a lot of categories, but I would have thought that with categorical sets it would be fine?