Hey, I used the training command in ./train_1st_phase.sh with only one main change (removed the distributed training arguments since I have just 1 GPU). The training runs fine for 1 epoch but gets stuck during the test stage forever. (Testing stage since the total batches per epoch is 118)
Hey, I used the training command in
./train_1st_phase.sh
with only one main change (removed the distributed training arguments since I have just 1 GPU). The training runs fine for 1 epoch but gets stuck during the test stage forever. (Testing stage since the total batches per epoch is 118)