Closed frankwwu closed 3 years ago
Hi Frankwwu
Thanks for the post. There might be different things in play. Following are some possible explanation & solutions:
It might be interesting to increase it for large datasets.
Make sure the type of each feature is as expected.
This index can be disabled as follow:
# Access to advanced hyper-parameters.
# See .proto sections in https://github.com/google/yggdrasil-decision-forests/blob/main/documentation/learners.md for mode details.
from yggdrasil_decision_forests.learner.random_forest import random_forest_pb2
from yggdrasil_decision_forests.learner.decision_tree import decision_tree_pb2
# Disable the pre-sorting of numerical features.
yggdrasil_training_config = tfdf.keras.core.YggdrasilTrainingConfig()
advanced_rf_config = yggdrasil_training_config.Extensions[random_forest_pb2.random_forest_config]
advanced_rf_config.decision_tree.internal.sorting_strategy = decision_tree_pb2.DecisionTreeTrainingConfig.Internal.SortingStrategy.IN_NODE
advanced_arguments = tfdf.keras.AdvancedArguments(yggdrasil_training_config=yggdrasil_training_config)
model = tfdf.keras.GradientBoostedTreesModel(num_trees=300, advanced_arguments=advanced_arguments)
with sys_pipes():
model.fit(train_ds)
Note: I am setting a TODO to make it easier to disable the index construction.
Make sure the task
argument of the model constructor is set appropriately.
In the meantime, a possible workaround is to train an ensemble of models where each model is trained on a small subset of the dataset.
TF-DF 0.1.6 introduces the parameter sorting_strategy
to disable easily the creation of the index (and reduce the memory consumption significantly).
The following code is equivalent to the code snippet given above with AdvancedArguments
.
model = tfdf.keras.GradientBoostedTreesModel(num_trees=300, sorting_strategy="IN_NODE")
TensorFlow Decision Forests appears being memory hungry. I compared it with PyCaret on Colab. TensorFlow Decision Forests crashed with the message “Your session crashed after using all available RAM.”, while PyCaret completed the work. Is there any feasible way to solve this problem?