tensorflow / decision-forests

A collection of state-of-the-art algorithms for the training, serving and interpretation of Decision Forest models in Keras.
Apache License 2.0
663 stars 110 forks source link

How do tf decision forests handle categorical variables? #87

Closed Cheril311 closed 2 years ago

Cheril311 commented 2 years ago

This is more of a question than an issue. I see that tensorflow decision forests can handle categorical features by itself, however I could not find the encoding strategy they use to convert categorical features. Can anyone help me understand this?

achoum commented 2 years ago

Hi,

The categorical_algorithm argument controls the consumption of categorical features (for example, see the Random Forest API or the hyper-parameters page).

Quoting:

categorical_algorithm:

How to learn splits on categorical attributes.

  1. CART (default): CART algorithm. Find categorical splits of the form "value \in mask". The solution is exact for binary classification, regression and ranking. It is approximated for multi-class classification. This is a good first algorithm to use. In case of overfitting (very small dataset, large dictionary), the "random" algorithm is a good alternative.

An example of such condition: attribute_1 in ["cat", "lion", "tiger"]

  1. ONE_HOT: One-hot encoding. Find the optimal categorical split of the form "attribute == param". This method is similar (but more efficient) than converting converting each possible categorical value into a boolean feature. This method is available for comparison purpose and generally performs worse than other alternatives.

An example of such condition: attribute_1 == "cat"

  1. RANDOM: Best splits among a set of random candidate. Find the a categorical split of the form "value \in mask" using a random search. This solution can be seen as an approximation of the CART algorithm. This method is a strong alternative to CART. This algorithm is inspired from section "5.1 Categorical Variables" of "Random Forest", 2001. Default: "CART".

An example of such condition: attribute_1 in ["cat", "lion", "tiger"]

A fourth algorithm is available for categorical set features (e.g. feature values made of a set of categorical items). You can learn more with the categorical_set* arguments.

An example of such condition: attribute_2 intersect ["cat", "lion", "tiger"]

Cheril311 commented 2 years ago

Thanks a lot @achoum, sorry for the late reply