The training gets stuck at `self.p_train_step` at a random step.

baofff commented 2 years ago

I trained on 8 A100 or 8 GeForce RTX 2080 Ti. On both set of devices, the training proceeds for a few steps and finally will get stuck at a random step. Anyone has the same issue?

LuChengTHU commented 2 years ago

Same.

pjh4993 commented 2 years ago

I trained on 8 A100 or 8 GeForce RTX 2080 Ti. On both set of devices, the training proceeds for a few steps and finally will get stuck at a random step. Anyone has the same issue?

Do you have any error msg? Have you checked GPU-util when such thing happens?

I also have similar issue during training. In my situation, the lock file used in custom torch ops was a problem. Due to crashed previous training, lock file of torch extension hasn't been deleted completely, introducing waiting sequence for using such operation.

You'd better check such locks are cleaned before training.

google-research / vdm

The training gets stuck at `self.p_train_step` at a random step. #3