lxa9867 / ControlVAR

This is the official implementation for ControlVAR.
50 stars 1 forks source link

About distrbute train #8

Closed QiopeWallt closed 1 month ago

QiopeWallt commented 1 month ago

I use a command like the following to train my model: python3 train_control_var.py \ --batch_size 8 \ --dataset_name Magicbrush \ --data_dir /data2/caikaixin_m20/magicbrush \ --gpus 4 \ --output_dir /data2/ControlVAR-main/local_output \ --multi_cond True \ --config configs/train_mask_var_Magicbrush_d12.yaml Although I have included the --gpus 4 argument, I noticed that the training process is only utilizing GPU0. Is there something missing in the train_control_var.py script for distributed training? Additionally, I observed some logical differences between the HPU and non-HPU versions. Could these differences be affecting multi-GPU training?

lxa9867 commented 1 month ago

Hi,

We use mpi to set the environment variables on the cluster. If you are running on the local, you need to use torchrun to set those variables.