google / yggdrasil-decision-forests

A library to train, evaluate, interpret, and productionize decision forest models such as Random Forest and Gradient Boosted Decision Trees.
https://ydf.readthedocs.io/
Apache License 2.0
495 stars 53 forks source link

Enable Multi-Label Prediction in Python API #120

Open AlexanderLavelle opened 3 months ago

AlexanderLavelle commented 3 months ago

I am curious if YDF can support multilabel outcomes (not mutually exclusive outcomes, shared tree space)?

per a comment in #118 and the response of @rstz

Since XGBoost offers this feature, it would be interesting to see here as well.

achoum commented 3 months ago

Thanks for the suggestion. This would be a valuable feature.

It's worth noting that YDF can emulate the "classical" multi-label approach with trees, where you train independent models for each label. While this requires manually iterating over the labels, it's essentially the same method used by SKLearn, by default by XGBoost, and TensorFlow Decision Forests.

On the plus side, doing this manually gives considerable flexibility. For example, you can train each label on different subsets of data (useful when some labels have missing values), or even use one label's prediction as an input feature for another (e.g., for self-supervised learning).

In the end, to aggregate these models into a single "block", a user can use TensorFlow, JAX, or define a custom aggregation function and pickle it (available since YDF 0.7).

Vector-leaf trees is a different approach not currently available in YDF. . Here, each tree makes a prediction to all the labels at the same time. This is a feature we have looked at, but did not find a justification for deployed yet (tagging this issue as a "feature request"). For example, we suspect this might a way to train compact models that can output dense embeddings.