Open jaketalyor32325 opened 2 years ago
@jaketalyor32325 I was able to comment out dist_util.sync_params(self.model.parameters())
from train_util.py
and get the training to run. I assume the parameters really only need to be sync'd across multiple-gpus.
@jaketalyor32325 Hello, do you need to make any adjustments when using your own data set code part when training guided_classifier?
Settings: Win10 Pro python 3.7.9 ptorch 1.8.1+cu111 1 GPU GLOO backend jupyter notebook
I can run the rest of the methods fine. I can run classifier_sample.py, super_res_sample.py but when I tried to run classifier_train.py I got a runtime error.
...\torch\distributed\distributed_c10d.py in broadcast(tensor, src, group, async_op) 1027 return work 1028 else: -> 1029 work.wait() 1030 1031
RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.
This is the argument I used and the training commands:
TRAIN_FLAGS="--iterations 300000 --anneal_lr True --batch_size 256 --lr 3e-4 --save_interval 10000 --weight_decay 0.05" CLASSIFIER_FLAGS="--image_size 128 --classifier_attention_resolutions 32,16,8 --classifier_depth 2 --classifier_width 128 --classifier_pool attention --classifier_resblock_updown True --classifier_use_scale_shift_norm True"
%run scripts/classifier_train.py --data_dir r"G:\data_set\imagenette2-160\train" $TRAIN_FLAGS $CLASSIFIER_FLAGS
Thanks for any comments and assistance in advance.