Closed tlapusan closed 1 year ago
Hi, this doesn't look correct to me. Do you have a repro for this?
Hi @rstz,
You can check this colab notebook: https://colab.research.google.com/drive/1XvsafToHzDQVR5BOOKEVCxsny_P1pRBU?usp=sharing
Thank you, I'll have a look
Minor update: This looks like a bug in the way integerized values are treated by TF-DF. In your example Sex_label == 1 corresponds to "male" in the dataset, but it corresponds to "female" when inspecting (and drawing) the graph. We are working on a fix.
Small aside: TF-DF supports string categories, so there is no need to convert strings to integers.
good to know about string categories, I had the data preprocessing step from other model libraries and used it as it is :)
@rstz @tlapusan is there any disadvantage to use TF-DF string categories compared to converting the string categories into an integer using Target Encoding or One-Hot Encoding? when is Target Encoding or One-Hot Encoding useful with decision trees if so or does it have any benefits? what happens if I have two similar websites one has www.cnn.com and cnn.com - would TF-DF string category map to the same? what happens if number of categories are high?
how does TF-DF supports string categories internally? how does Yggdrasil Random Forests handle it if there are large number of high cardinality categorical variables?
Is there any disadvantage to use TF-DF string categories compared to converting the string categories into an integer using Target Encoding or One-Hot Encoding?
Generally, using one-hot encoding with decision forests makes the model larger and less accurate model than using other options. For this reason, it is not recommended.
From experience, and depending on the dataset, target encoding is complementary to CART or RANDOM categorical splits.
what happens if I have two similar websites one has www.cnn.com and cnn.com - would TF-DF string category map to the same?
Both CART and RANDOM splitter learn conditions of the type "attribute in mask". If this is supported by the dataset, they can learn splits such as "site in [www.cnn.com, cnn.com]. On the other hand, one hot encoding can only check on categorical values at a time.
what happens if the number of categories are high?
There is a risk of overfitting. In this case, RANDOM categorical splits, target propagation, regularization or more advanced techniques are required.
how does TF-DF supports string categories internally?
It depends on the semantic of the attribute. Here is the list of the supported semantics. For a categorical attribute, as mentioned above, three splitter algorithms are available: CART, RANDOM and ONE_HOT. For a categorical set attribute (e.g. a bag of works), another algorithm is used.
Minor update: This looks like a bug in the way integerized values are treated by TF-DF. In your example Sex_label == 1 corresponds to "male" in the dataset, but it corresponds to "female" when inspecting (and drawing) the graph. We are working on a fix.
Small aside: TF-DF supports string categories, so there is no need to convert strings to integers.
Hi @rstz, do you know when this issue will be fixed ?
Minor update: This looks like a bug in the way integerized values are treated by TF-DF. In your example Sex_label == 1 corresponds to "male" in the dataset, but it corresponds to "female" when inspecting (and drawing) the graph. We are working on a fix.
Small aside: TF-DF supports string categories, so there is no need to convert strings to integers.
@rstz Could you guide me how to find which are the integer values associated by TF-DF for a categorical features ? I'm working to integrate the TF-DF in https://github.com/parrt/dtreeviz library for visualisations and I need the integer values.
Thanks.
Hi, apologies for not responding, I missed that email.
We don't have a fix yet, but this is very high on our Todo list.
sounds good @rstz , thanks
I believe this is solved since dtreeviz support has now landed :)
@rstz there is dtreeviz now?
https://www.tensorflow.org/decision_forests/tutorials/dtreeviz_colab is a tutorial for using dtreeviz with TF-DF
Hi,
I would like to ask you about the decision node path in case the node has a categorical or numerical threshold.
What I observed is when I have a categorical node and the threshold condition is met, then the path is going to the left.
If the threshold is a numerical one and the condition is met, then the path is going to the right.
Is this behavior the intended one ?