Error when training with 1 gpu: RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.

jaketalyor32325 commented 2 years ago

Settings: Win10 Pro python 3.7.9 ptorch 1.8.1+cu111 1 GPU GLOO backend jupyter notebook

I can run the rest of the methods fine. I can run classifier_sample.py, super_res_sample.py but when I tried to run classifier_train.py I got a runtime error.

...\torch\distributed\distributed_c10d.py in broadcast(tensor, src, group, async_op) 1027 return work 1028 else: -> 1029 work.wait() 1030 1031

RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.

This is the argument I used and the training commands:

TRAIN_FLAGS="--iterations 300000 --anneal_lr True --batch_size 256 --lr 3e-4 --save_interval 10000 --weight_decay 0.05" CLASSIFIER_FLAGS="--image_size 128 --classifier_attention_resolutions 32,16,8 --classifier_depth 2 --classifier_width 128 --classifier_pool attention --classifier_resblock_updown True --classifier_use_scale_shift_norm True"

%run scripts/classifier_train.py --data_dir r"G:\data_set\imagenette2-160\train" $TRAIN_FLAGS $CLASSIFIER_FLAGS

Thanks for any comments and assistance in advance.

DCNemesis commented 2 years ago

@jaketalyor32325 I was able to comment out dist_util.sync_params(self.model.parameters()) from train_util.py and get the training to run. I assume the parameters really only need to be sync'd across multiple-gpus.

ONobody commented 1 year ago

@jaketalyor32325 Hello, do you need to make any adjustments when using your own data set code part when training guided_classifier?

openai / guided-diffusion

Error when training with 1 gpu: RuntimeError: a leaf Variable that requires grad is being used in an in-place operation. #29