google-research / vdm

Apache License 2.0
300 stars 25 forks source link

The training gets stuck at `self.p_train_step` at a random step. #3

Open baofff opened 2 years ago

baofff commented 2 years ago

I trained on 8 A100 or 8 GeForce RTX 2080 Ti. On both set of devices, the training proceeds for a few steps and finally will get stuck at a random step. Anyone has the same issue?

LuChengTHU commented 2 years ago

Same.

pjh4993 commented 2 years ago

I trained on 8 A100 or 8 GeForce RTX 2080 Ti. On both set of devices, the training proceeds for a few steps and finally will get stuck at a random step. Anyone has the same issue?

Do you have any error msg? Have you checked GPU-util when such thing happens?

I also have similar issue during training. In my situation, the lock file used in custom torch ops was a problem. Due to crashed previous training, lock file of torch extension hasn't been deleted completely, introducing waiting sequence for using such operation.

You'd better check such locks are cleaned before training.