Closed mr-francois closed 11 months ago
Hi @mr-francois.
It looks like there is a deadlock when tf$distribute$MirroredStrategy
is used with an R generator to generate the training data batches. Fixing this will require some investigation.
Note however that a user-defined generator will generally be the bottleneck in the training pipeline. If you can define your training pipeline using {tfdatasets}, you'll see much greater performance.
Fixed on main now.
Whenever i try to train a model with multiple GPUs and mirrored strategy, training freezes at first validation step. If i don't use validation data, training freezes after last epoch.
Equivalent code in python runs without problems on the same machine and conda environment.
These are my current settings: