hpcaitech / ColossalAI

Making large AI models cheaper, faster and more accessible
https://www.colossalai.org
Apache License 2.0
38.83k stars 4.35k forks source link

[BUG]: ddp training in diffusion #3598

Open zhangvia opened 1 year ago

zhangvia commented 1 year ago

🐛 Describe the bug

how can i use the ddp train in diffusion? i saw the train_ddp.yaml,but there is nothing different with the train_colossalai.yaml. how do i set the numbers of gpu and nodes or the port of nodes? do you have any docs about these?

Environment

No response

JThh commented 1 year ago

The two configurations are actually different. You may change settings from this line onwards. To run the codes, you may execute python main.py --logdir /tmp/ --train --base configs/train_colossalai.yaml --ckpt 512-base-ema.ckpt as per our guide.

NatalieC323 commented 1 year ago

Thanks for your question. You need to first refer to README.md to change the configurations. For instance, the number of devices in the YAML file represents the number of GPUs. For training strategy, you may need to verify the strategy in main.py and then use the code above to run the whole thing.

zhangvia commented 1 year ago

Thanks for your question. You need to first refer to README.md to change the configurations. For instance, the number of devices in the YAML file represents the number of GPUs. For training strategy, you may need to verify the strategy in main.py and then use the code above to run the whole thing.

i knew it, but the parameters in yaml only can allow me to train model on multi gpus on a single device,what if i want to train it on different devices?how to set the number of nodes and the number of devices on a single node,and how to set the ip address and port of every node? i'll appreciate it if you can give me some advice

NatalieC323 commented 1 year ago

Thanks for your question. You need to first refer to README.md to change the configurations. For instance, the number of devices in the YAML file represents the number of GPUs. For training strategy, you may need to verify the strategy in main.py and then use the code above to run the whole thing.

i knew it, but the parameters in yaml only can allow me to train model on multi gpus on a single device,what if i want to train it on different devices?how to set the number of nodes and the number of devices on a single node,and how to set the ip address and port of every node? i'll appreciate it if you can give me some advice

The stable diffusion is constructed based on the Pyorch Lightning Structure. For the detailed usage, please refer to the link: https://lightning.ai/docs/pytorch/latest/api/lightning.pytorch.trainer.trainer.Trainer.html#lightning.pytorch.trainer.trainer.Trainer

zhangvia commented 1 year ago

Thanks for your question. You need to first refer to README.md to change the configurations. For instance, the number of devices in the YAML file represents the number of GPUs. For training strategy, you may need to verify the strategy in main.py and then use the code above to run the whole thing.

i knew it, but the parameters in yaml only can allow me to train model on multi gpus on a single device,what if i want to train it on different devices?how to set the number of nodes and the number of devices on a single node,and how to set the ip address and port of every node? i'll appreciate it if you can give me some advice

The stable diffusion is constructed based on the Pyorch Lightning Structure. For the detailed usage, please refer to the link: https://lightning.ai/docs/pytorch/latest/api/lightning.pytorch.trainer.trainer.Trainer.html#lightning.pytorch.trainer.trainer.Trainer

thanks for your reply, so is colossalai a startegy about reducing training gpu memory on single device? does it help with the distributed training on multiple nodes?

binmakeswell commented 1 year ago

Hi @zhangvia Colossal-AI is designed for distributed training on multiple nodes, but some of our features are also applicable to single GPUs or single nodes.

zhangvia commented 1 year ago

Hi @zhangvia Colossal-AI is designed for distributed training on multiple nodes, but some of our features are also applicable to single GPUs or single nodes.

so, when i use the colossal-ai strategy in pytorch lightning, can i get all features of colossal-ai?