Support tf.distribute strategies in TF-DF

tensorflow / decision-forests

A collection of state-of-the-art algorithms for the training, serving and interpretation of Decision Forest models in Keras.

Apache License 2.0

663 stars 110 forks source link

Support tf.distribute strategies in TF-DF #44

Closed sibyjackgrove closed 1 year ago

sibyjackgrove commented 3 years ago

Which tf.distribute strategy would be most suitable to use with tfdf if we were to use it with multiple nodes of an HPC.

arvnds commented 3 years ago

Hi sibyjackgrove.

TF-DF does not yet support tf.distribute strategies. This is because the currently implemented decision forest algorithms are not distributed algorithms and require the entire dataset in memory on a single machine. If you use a multiworker setup, the current algorithms will likely either crash or use only one of the machines -- this is undefined behavior.

However, a distributed gradient boosted tree algorithm is in the works and will hopefully be available later this year. I have relabeled this issue, and we will update it when the appropriate release is pushed out.

Thanks! Arvind

achoum commented 3 years ago

Distributed training was published in the TF-DF 0.2.0 release. See the distributed training documentation for more details.

Note that the code is still experimental (documentation is still in writing), and that tf.distribute.experimental.ParameterServerStrategy is only compatible with a TF+TF-DF monolithic build. In other words, ParameterServerStrategy is not yet compatible with the PyPi TF-DF package. In the mean time, TF-DF distributed training is possible with the Yggdrasil Decision Forests GRPC Distribution Strategy.

This bug is left open and will be closed when ParameterServerStrategy is fulled supported.

Mathieu

rstz commented 1 year ago

Since this bug was opened, distributed training has been rewritten and is now stable. See https://github.com/tensorflow/decision-forests/blob/main/examples/distributed_training.py for an example. If there are still feature requests or issues with TF-DF distributed training, please open a new issue.