Closed sibyjackgrove closed 1 year ago
Hi sibyjackgrove.
TF-DF does not yet support tf.distribute strategies. This is because the currently implemented decision forest algorithms are not distributed algorithms and require the entire dataset in memory on a single machine. If you use a multiworker setup, the current algorithms will likely either crash or use only one of the machines -- this is undefined behavior.
However, a distributed gradient boosted tree algorithm is in the works and will hopefully be available later this year. I have relabeled this issue, and we will update it when the appropriate release is pushed out.
Thanks! Arvind
Distributed training was published in the TF-DF 0.2.0 release. See the distributed training documentation for more details.
Note that the code is still experimental (documentation is still in writing), and that tf.distribute.experimental.ParameterServerStrategy
is only compatible with a TF+TF-DF monolithic build. In other words, ParameterServerStrategy
is not yet compatible with the PyPi TF-DF package. In the mean time, TF-DF distributed training is possible with the Yggdrasil Decision Forests GRPC Distribution Strategy.
This bug is left open and will be closed when ParameterServerStrategy
is fulled supported.
Mathieu
Since this bug was opened, distributed training has been rewritten and is now stable. See https://github.com/tensorflow/decision-forests/blob/main/examples/distributed_training.py for an example. If there are still feature requests or issues with TF-DF distributed training, please open a new issue.
Which
tf.distribute
strategy would be most suitable to use with tfdf if we were to use it with multiple nodes of an HPC.