Is the number of train samples from tree nodes correct?

tensorflow / decision-forests

A collection of state-of-the-art algorithms for the training, serving and interpretation of Decision Forest models in Keras.

Apache License 2.0

663 stars 110 forks source link

Is the number of train samples from tree nodes correct? #105

Closed tlapusan closed 2 years ago

tlapusan commented 2 years ago

Hi,

I'm trying to integrate tf-df into dtreeviz library for tree visualisations.

While working for this I observed that the number of samples from default visualisations (plot_model) is not the same with my own calculations.

This is the tree plot:

And I checked the number of samples for child nodes from root node.

You can see that the number samples is different. For example the left child node should contains 623 samples, instead of 641. Is this because of some bagging sampling strategy ?

Here is how the tree structure will look in dtreeviz... but we still have to fix some little issues until it will be ready for release.

Cheril311 commented 2 years ago

@tlapusan can you please tell what hyperparameters you used and what tree-based model you used?

tlapusan commented 2 years ago

Indeed, I should mentioned them in the beginning :) I'm using Mac OS (Montery), Python 3.8.2.

achoum commented 2 years ago

Hi @tlapusan,

The number of examples printed in the plot is the number of examples that reached a particular node during training. This is effectively the number of training examples for the tree. The training examples for a tree can be different from the training examples for a model.

For example, by default, when training a Random Forest, the training examples of each tree are bagged from the original training dataset. While the number of examples is the same, the value of the examples are different.

For Gradient Boosted Trees and CART, by default, unless you provide a validation dataset in fit or if you disable early stopping, some of the training examples are effectively used for validation (and then virtually removed from the training dataset).

In your example, you can try replacing your model definition by:

model = tfdf.keras.CartModel(validation_ratio=0.0)

Ps: Awesome for the dtreeviz integration. Please keep us in touch!

tlapusan commented 2 years ago

Hi @achoum,

thanks for your answer, it makes totally sense. I had also the assumption that the bagging could be the cause. Thanks for confirming this.

Sure, we will keep you up to date with dtreeviz integration :)

tlapusan commented 2 years ago

@achoum we are almost done with the integration in dtreeviz for tf.keras.RandomForestModel. You can take a look here https://github.com/tlapusan/dtreeviz/blob/support_for_TF-DF_trees_%23181/notebooks/dtreeviz_tensorflow_visualisations.ipynb

We have a little issue with the categorical features and the fact that TF-DF is converting them internally to integers. You can have more details about it from #111 .

Do you know how to find out which are the integer values associated to a categorical feature by TF-DF ? I need this information to make dtreeviz methods to work internally. Thanks!