I use a command like the following to train my model:
python3 train_control_var.py \ --batch_size 8 \ --dataset_name Magicbrush \ --data_dir /data2/caikaixin_m20/magicbrush \ --gpus 4 \ --output_dir /data2/ControlVAR-main/local_output \ --multi_cond True \ --config configs/train_mask_var_Magicbrush_d12.yaml
Although I have included the --gpus 4 argument, I noticed that the training process is only utilizing GPU0. Is there something missing in the train_control_var.py script for distributed training? Additionally, I observed some logical differences between the HPU and non-HPU versions. Could these differences be affecting multi-GPU training?
I use a command like the following to train my model:
python3 train_control_var.py \ --batch_size 8 \ --dataset_name Magicbrush \ --data_dir /data2/caikaixin_m20/magicbrush \ --gpus 4 \ --output_dir /data2/ControlVAR-main/local_output \ --multi_cond True \ --config configs/train_mask_var_Magicbrush_d12.yaml
Although I have included the --gpus 4 argument, I noticed that the training process is only utilizing GPU0. Is there something missing in the train_control_var.py script for distributed training? Additionally, I observed some logical differences between the HPU and non-HPU versions. Could these differences be affecting multi-GPU training?