Open zhangvia opened 1 year ago
The two configurations are actually different. You may change settings from this line onwards. To run the codes, you may execute python main.py --logdir /tmp/ --train --base configs/train_colossalai.yaml --ckpt 512-base-ema.ckpt
as per our guide.
Thanks for your question. You need to first refer to README.md to change the configurations. For instance, the number of devices in the YAML file represents the number of GPUs. For training strategy, you may need to verify the strategy in main.py and then use the code above to run the whole thing.
Thanks for your question. You need to first refer to README.md to change the configurations. For instance, the number of devices in the YAML file represents the number of GPUs. For training strategy, you may need to verify the strategy in main.py and then use the code above to run the whole thing.
i knew it, but the parameters in yaml only can allow me to train model on multi gpus on a single device,what if i want to train it on different devices?how to set the number of nodes and the number of devices on a single node,and how to set the ip address and port of every node? i'll appreciate it if you can give me some advice
Thanks for your question. You need to first refer to README.md to change the configurations. For instance, the number of devices in the YAML file represents the number of GPUs. For training strategy, you may need to verify the strategy in main.py and then use the code above to run the whole thing.
i knew it, but the parameters in yaml only can allow me to train model on multi gpus on a single device,what if i want to train it on different devices?how to set the number of nodes and the number of devices on a single node,and how to set the ip address and port of every node? i'll appreciate it if you can give me some advice
The stable diffusion is constructed based on the Pyorch Lightning Structure. For the detailed usage, please refer to the link: https://lightning.ai/docs/pytorch/latest/api/lightning.pytorch.trainer.trainer.Trainer.html#lightning.pytorch.trainer.trainer.Trainer
Thanks for your question. You need to first refer to README.md to change the configurations. For instance, the number of devices in the YAML file represents the number of GPUs. For training strategy, you may need to verify the strategy in main.py and then use the code above to run the whole thing.
i knew it, but the parameters in yaml only can allow me to train model on multi gpus on a single device,what if i want to train it on different devices?how to set the number of nodes and the number of devices on a single node,and how to set the ip address and port of every node? i'll appreciate it if you can give me some advice
The stable diffusion is constructed based on the Pyorch Lightning Structure. For the detailed usage, please refer to the link: https://lightning.ai/docs/pytorch/latest/api/lightning.pytorch.trainer.trainer.Trainer.html#lightning.pytorch.trainer.trainer.Trainer
thanks for your reply, so is colossalai a startegy about reducing training gpu memory on single device? does it help with the distributed training on multiple nodes?
Hi @zhangvia Colossal-AI is designed for distributed training on multiple nodes, but some of our features are also applicable to single GPUs or single nodes.
Hi @zhangvia Colossal-AI is designed for distributed training on multiple nodes, but some of our features are also applicable to single GPUs or single nodes.
so, when i use the colossal-ai strategy in pytorch lightning, can i get all features of colossal-ai?
🐛 Describe the bug
how can i use the ddp train in diffusion? i saw the train_ddp.yaml,but there is nothing different with the train_colossalai.yaml. how do i set the numbers of gpu and nodes or the port of nodes? do you have any docs about these?
Environment
No response