google / automl

Google Brain AutoML
Apache License 2.0
6.22k stars 1.45k forks source link

MultiWorkerMirrorStrategy for distributed training not working in gpus #964

Open ankur47 opened 3 years ago

ankur47 commented 3 years ago

Hi I am using MultiWorkerMirrorStrategy and tf.estimator.train_and_evaluate for distributed training with 3 epoch. Please find below the information:

GPU: 4 x NVIDIA Tesla V100
Datasets: COCOA 
Model: Efficientdet-d5
Tensorflow: 2.4.0-gpu

Error when trying to implement this model: Bad status from CompleteGroupDistributed: Failed precondition: Device /job:worker/replica:0/task:1/device:GPU:0 current incarnation doesn't match with one in the group. This usually means this worker has restarted but the collective leader hasn't, or this worker connects to a wrong cluster.

I have changed some of the few lines in main.py file

image

image

FYI: Using only train mode

DirkFi commented 2 years ago

Same error here. Has your problem been solved?