Decision path through node thresholds

tlapusan commented 2 years ago

Hi,

I would like to ask you about the decision node path in case the node has a categorical or numerical threshold.

What I observed is when I have a categorical node and the threshold condition is met, then the path is going to the left.

If the threshold is a numerical one and the condition is met, then the path is going to the right.

Is this behavior the intended one ?

rstz commented 2 years ago

Hi, this doesn't look correct to me. Do you have a repro for this?

tlapusan commented 2 years ago

Hi @rstz,

You can check this colab notebook: https://colab.research.google.com/drive/1XvsafToHzDQVR5BOOKEVCxsny_P1pRBU?usp=sharing

rstz commented 2 years ago

Thank you, I'll have a look

rstz commented 2 years ago

Minor update: This looks like a bug in the way integerized values are treated by TF-DF. In your example Sex_label == 1 corresponds to "male" in the dataset, but it corresponds to "female" when inspecting (and drawing) the graph. We are working on a fix.

Small aside: TF-DF supports string categories, so there is no need to convert strings to integers.

tlapusan commented 2 years ago

good to know about string categories, I had the data preprocessing step from other model libraries and used it as it is :)

Arnold1 commented 2 years ago

@rstz @tlapusan is there any disadvantage to use TF-DF string categories compared to converting the string categories into an integer using Target Encoding or One-Hot Encoding? when is Target Encoding or One-Hot Encoding useful with decision trees if so or does it have any benefits? what happens if I have two similar websites one has www.cnn.com and cnn.com - would TF-DF string category map to the same? what happens if number of categories are high?

how does TF-DF supports string categories internally? how does Yggdrasil Random Forests handle it if there are large number of high cardinality categorical variables?

achoum commented 2 years ago

Is there any disadvantage to use TF-DF string categories compared to converting the string categories into an integer using Target Encoding or One-Hot Encoding?

Generally, using one-hot encoding with decision forests makes the model larger and less accurate model than using other options. For this reason, it is not recommended.

From experience, and depending on the dataset, target encoding is complementary to CART or RANDOM categorical splits.

what happens if I have two similar websites one has www.cnn.com and cnn.com - would TF-DF string category map to the same?

Both CART and RANDOM splitter learn conditions of the type "attribute in mask". If this is supported by the dataset, they can learn splits such as "site in [www.cnn.com, cnn.com]. On the other hand, one hot encoding can only check on categorical values at a time.

what happens if the number of categories are high?

There is a risk of overfitting. In this case, RANDOM categorical splits, target propagation, regularization or more advanced techniques are required.

how does TF-DF supports string categories internally?

It depends on the semantic of the attribute. Here is the list of the supported semantics. For a categorical attribute, as mentioned above, three splitter algorithms are available: CART, RANDOM and ONE_HOT. For a categorical set attribute (e.g. a bag of works), another algorithm is used.

tlapusan commented 2 years ago

Minor update: This looks like a bug in the way integerized values are treated by TF-DF. In your example Sex_label == 1 corresponds to "male" in the dataset, but it corresponds to "female" when inspecting (and drawing) the graph. We are working on a fix.

Small aside: TF-DF supports string categories, so there is no need to convert strings to integers.

Hi @rstz, do you know when this issue will be fixed ?

tlapusan commented 2 years ago

Minor update: This looks like a bug in the way integerized values are treated by TF-DF. In your example Sex_label == 1 corresponds to "male" in the dataset, but it corresponds to "female" when inspecting (and drawing) the graph. We are working on a fix.

Small aside: TF-DF supports string categories, so there is no need to convert strings to integers.

@rstz Could you guide me how to find which are the integer values associated by TF-DF for a categorical features ? I'm working to integrate the TF-DF in https://github.com/parrt/dtreeviz library for visualisations and I need the integer values.

Thanks.

rstz commented 2 years ago

Hi, apologies for not responding, I missed that email.

We don't have a fix yet, but this is very high on our Todo list.

tlapusan commented 2 years ago

sounds good @rstz , thanks

rstz commented 1 year ago

I believe this is solved since dtreeviz support has now landed :)

Arnold1 commented 1 year ago

@rstz there is dtreeviz now?

rstz commented 1 year ago

https://www.tensorflow.org/decision_forests/tutorials/dtreeviz_colab is a tutorial for using dtreeviz with TF-DF

tensorflow / decision-forests

Decision path through node thresholds #111