Hi
I am using MultiWorkerMirrorStrategy and tf.estimator.train_and_evaluate for distributed training with 3 epoch.
Please find below the information:
GPU: 4 x NVIDIA Tesla V100
Datasets: COCOA
Model: Efficientdet-d5
Tensorflow: 2.4.0-gpu
Error when trying to implement this model:
Bad status from CompleteGroupDistributed: Failed precondition: Device /job:worker/replica:0/task:1/device:GPU:0 current incarnation doesn't match with one in the group. This usually means this worker has restarted but the collective leader hasn't, or this worker connects to a wrong cluster.
I have changed some of the few lines in main.py file
Hi I am using
MultiWorkerMirrorStrategy
andtf.estimator.train_and_evaluate
for distributed training with 3 epoch. Please find below the information:Error when trying to implement this model:
Bad status from CompleteGroupDistributed: Failed precondition: Device /job:worker/replica:0/task:1/device:GPU:0 current incarnation doesn't match with one in the group. This usually means this worker has restarted but the collective leader hasn't, or this worker connects to a wrong cluster.
I have changed some of the few lines in main.py file
FYI: Using only train mode