Open andreemic opened 1 year ago
Me too. After it print LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3], cuda just cored dump
Wrap lines 20-35 in if __name__ == '__main__':
and run training CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python tutorial_train.py
lines 20-35 of what file?
lines 20-35 of what file?
The file you originally wrote about. tutorial_train.py
lines 20-35 of what file?
The file you originally wrote about.
tutorial_train.py
Still have the same problem
Had the same issue, and this fixed it. Although I changed the trainer code to this but explicitly specifying to use the second GPU
trainer = pl.Trainer(accelerator="gpu", devices=[1], precision=32, callbacks=[logger])
Wrap lines 20-35 in
if __name__ == '__main__':
and run trainingCUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python tutorial_train.py
accelerator_connector.py:287: LightningDeprecationWarning: Passing
Trainer(accelerator='ddp')has been deprecated in v1.5 and will be removed in v1.7. Use
Trainer(strategy='ddp')instead.
I think we shoulde use strategy='ddp'
Hi
@JunnYu
This is how I do it.
trainer = pl.Trainer(strategy="ddp", accelerator="gpu", devices=[0,1], precision=32, callbacks=[logger],max_epochs=max_epochs)
I am running it on 2X3090s.
I think the final answer is a combination of 3 answers above: by @tg-bomze, @SuroshAhmadZobair and @JunnYu . You need to apply the following modifications to the original tutorial_train.py
script:
tutorial_train.py
with if __name__ == "__main__":
.export CUDA_VISIBLE_DEVICES=0,1
. If you have more GPU-s, or wish to use specific GPU-s, feel free to use your own ID-s, for example export CUDA_VISIBLE_DEVICES=0,3,6,7
. If you only have 2 GPU-s, then use the first command.pl.Trainer
object, i.e. trainer = pl.Trainer(gpus=2, precision=32, callbacks=[logger])
.pl.Trainer
object, so you need to modify that line of code again to get trainer = pl.Trainer(strategy='ddp', gpus=2, precision=32, callbacks=[logger])
.Eventually, your entire training script should look like this:
from share import *
import pytorch_lightning as pl
from torch.utils.data import DataLoader
from tutorial_dataset import MyDataset
from cldm.logger import ImageLogger
from cldm.model import create_model, load_state_dict
# Configs
resume_path = './models/control_sd15_ini.ckpt'
batch_size = 2
logger_freq = 300
learning_rate = 1e-5
sd_locked = True
only_mid_control = False
if __name__ == "__main__":
# First use cpu to load models. Pytorch Lightning will automatically move it to GPUs.
model = create_model('./models/cldm_v15.yaml').cpu()
model.load_state_dict(load_state_dict(resume_path, location='cpu'))
model.learning_rate = learning_rate
model.sd_locked = sd_locked
model.only_mid_control = only_mid_control
# Misc
dataset = MyDataset()
dataloader = DataLoader(dataset, num_workers=0, batch_size=batch_size, shuffle=True)
logger = ImageLogger(batch_frequency=logger_freq)
trainer = pl.Trainer(strategy='ddp', gpus=2, precision=32, callbacks=[logger])
# Train!
trainer.fit(model, dataloader)
As you can see, the modified script does not differ much from the original script. Once you have everything setup, simply run the training command:
python tutorial_train.py
It worked for me on 2xQuadro RTX 6000.
P.S. I had to reduce the batch size from 4 to 2 to make sure I don't get an out-of memory error. If you have enough memory, you can stick to batch size 4. P.P.S. As I said earlier, I had to combine several previous responses to make things work. I thought it would be easier for others if these answers were merged into one. It is also possible that different people have different problems when running things, so I cannot guarantee that this solution is 100% correct. In any case I hope it helps.
So I did what @MikaYeghi - total copy paste , yet my code is still not running properly on 2xRTX 3090
If I run it on 2 GPUs it outputs this and hangs in there:
No module 'xformers'. Proceeding without it.
ControlLDM: Running in eps-prediction mode
DiffusionWrapper has 859.52 M params.
making attention of type 'vanilla' with 512 in_channels
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
making attention of type 'vanilla' with 512 in_channels
Loaded model config from [./models/cldm_v15.yaml]
Loaded state_dict from [./models/control_sd15_ini.ckpt]
initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 2 processes
----------------------------------------------------------------------------------------------------
If I run it with only 1 GPU - like default tutorial, it runs well and output is like this:
No module 'xformers'. Proceeding without it.
ControlLDM: Running in eps-prediction mode
DiffusionWrapper has 859.52 M params.
making attention of type 'vanilla' with 512 in_channels
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
making attention of type 'vanilla' with 512 in_channels
Loaded model config from [./models/cldm_v15.yaml]
Loaded state_dict from [./models/control_sd15_ini.ckpt]
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
/home/sd4/anaconda3/envs/control/lib/python3.8/site-packages/pytorch_lightning/trainer/configuration_validator.py:118: UserWarning: You defined a `validation_step` but have no `val_dataloader`. Skipping val loop.
rank_zero_warn("You defined a `validation_step` but have no `val_dataloader`. Skipping val loop.")
/home/sd4/anaconda3/envs/control/lib/python3.8/site-packages/pytorch_lightning/trainer/configuration_validator.py:280: LightningDeprecationWarning: Base `LightningModule.on_train_batch_start` hook signature has changed in v1.5. The `dataloader_idx` argument will be removed in v1.7.
rank_zero_deprecation(
/home/sd4/anaconda3/envs/control/lib/python3.8/site-packages/pytorch_lightning/trainer/configuration_validator.py:287: LightningDeprecationWarning: Base `Callback.on_train_batch_end` hook signature has changed in v1.5. The `dataloader_idx` argument will be removed in v1.7.
rank_zero_deprecation(
initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
| Name | Type | Params
---------------------------------------------------------
0 | model | DiffusionWrapper | 859 M
1 | first_stage_model | AutoencoderKL | 83.7 M
2 | cond_stage_model | FrozenCLIPEmbedder | 123 M
3 | control_model | ControlNet | 361 M
---------------------------------------------------------
1.2 B Trainable params
206 M Non-trainable params
1.4 B Total params
5,710.058 Total estimated model params size (MB)
/home/sd4/anaconda3/envs/control/lib/python3.8/site-packages/pytorch_lightning/trainer/data_loading.py:110: UserWarning: The dataloader, train_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 12 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
rank_zero_warn(
Epoch 0: 0%| | 0/25000 [00:00<?, ?it/s]/home/sd4/anaconda3/envs/control/lib/python3.8/site-packages/pytorch_lightning/utilities/data.py:56: UserWarning: Trying to infer the `batch_size` from an ambiguous collection. The batch size we found is 2. To avoid any miscalculations, use `self.log(..., batch_size=batch_size)`.
warning_cache.warn(
Data shape for DDIM sampling is (2, 4, 64, 64), eta 0.0
Running DDIM Sampling with 50 timesteps
DDIM Sampler: 100%|█████████████████████████████| 50/50 [00:35<00:00, 1.39it/s]
Epoch 0: 0%| | 7/25000 [00:49<48:54:16, 7.04s/it, loss=0.0126, v_num=26, trai
What I also noticed is with 2 GPU setup, my cards somehow run to 100% GPU, but actually only around 170W out of 350 or 420 (one has 350 and second 420 W). Also VRAM usage goes only to 6-7 GB instead of 22 when running only on 1 GPU properly. Any ideas what can be wrong? Tried many things yesterday..
update:
So I did what @MikaYeghi - total copy paste , yet my code is still not running properly on 2xRTX 3090
If I run it on 2 GPUs it outputs this and hangs in there:
No module 'xformers'. Proceeding without it. ControlLDM: Running in eps-prediction mode DiffusionWrapper has 859.52 M params. making attention of type 'vanilla' with 512 in_channels Working with z of shape (1, 4, 32, 32) = 4096 dimensions. making attention of type 'vanilla' with 512 in_channels Loaded model config from [./models/cldm_v15.yaml] Loaded state_dict from [./models/control_sd15_ini.ckpt] initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2 ---------------------------------------------------------------------------------------------------- distributed_backend=nccl All distributed processes registered. Starting with 2 processes ----------------------------------------------------------------------------------------------------
If I run it with only 1 GPU - like default tutorial, it runs well and output is like this:
No module 'xformers'. Proceeding without it. ControlLDM: Running in eps-prediction mode DiffusionWrapper has 859.52 M params. making attention of type 'vanilla' with 512 in_channels Working with z of shape (1, 4, 32, 32) = 4096 dimensions. making attention of type 'vanilla' with 512 in_channels Loaded model config from [./models/cldm_v15.yaml] Loaded state_dict from [./models/control_sd15_ini.ckpt] GPU available: True, used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs /home/sd4/anaconda3/envs/control/lib/python3.8/site-packages/pytorch_lightning/trainer/configuration_validator.py:118: UserWarning: You defined a `validation_step` but have no `val_dataloader`. Skipping val loop. rank_zero_warn("You defined a `validation_step` but have no `val_dataloader`. Skipping val loop.") /home/sd4/anaconda3/envs/control/lib/python3.8/site-packages/pytorch_lightning/trainer/configuration_validator.py:280: LightningDeprecationWarning: Base `LightningModule.on_train_batch_start` hook signature has changed in v1.5. The `dataloader_idx` argument will be removed in v1.7. rank_zero_deprecation( /home/sd4/anaconda3/envs/control/lib/python3.8/site-packages/pytorch_lightning/trainer/configuration_validator.py:287: LightningDeprecationWarning: Base `Callback.on_train_batch_end` hook signature has changed in v1.5. The `dataloader_idx` argument will be removed in v1.7. rank_zero_deprecation( initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1 ---------------------------------------------------------------------------------------------------- distributed_backend=nccl All distributed processes registered. Starting with 1 processes ---------------------------------------------------------------------------------------------------- LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1] | Name | Type | Params --------------------------------------------------------- 0 | model | DiffusionWrapper | 859 M 1 | first_stage_model | AutoencoderKL | 83.7 M 2 | cond_stage_model | FrozenCLIPEmbedder | 123 M 3 | control_model | ControlNet | 361 M --------------------------------------------------------- 1.2 B Trainable params 206 M Non-trainable params 1.4 B Total params 5,710.058 Total estimated model params size (MB) /home/sd4/anaconda3/envs/control/lib/python3.8/site-packages/pytorch_lightning/trainer/data_loading.py:110: UserWarning: The dataloader, train_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 12 which is the number of cpus on this machine) in the `DataLoader` init to improve performance. rank_zero_warn( Epoch 0: 0%| | 0/25000 [00:00<?, ?it/s]/home/sd4/anaconda3/envs/control/lib/python3.8/site-packages/pytorch_lightning/utilities/data.py:56: UserWarning: Trying to infer the `batch_size` from an ambiguous collection. The batch size we found is 2. To avoid any miscalculations, use `self.log(..., batch_size=batch_size)`. warning_cache.warn( Data shape for DDIM sampling is (2, 4, 64, 64), eta 0.0 Running DDIM Sampling with 50 timesteps DDIM Sampler: 100%|█████████████████████████████| 50/50 [00:35<00:00, 1.39it/s] Epoch 0: 0%| | 7/25000 [00:49<48:54:16, 7.04s/it, loss=0.0126, v_num=26, trai
- any ideas?
What I also noticed is with 2 GPU setup, my cards somehow run to 100% GPU, but actually only around 170W out of 350 or 420 (one has 350 and second 420 W). Also VRAM usage goes only to 6-7 GB instead of 22 when running only on 1 GPU properly. Any ideas what can be wrong? Tried many things yesterday..
update:
- just got idea if Conda could be problem? Should I stick with pip venv ? Or CUDA? I have I believe 12.3 with PyTorch compiled for 12.1.
I think I also had a similar issue at some point, but I resolved it easily... Can you check if you have all the GPUs available and visible? Maybe try running the export CUDA_VISIBLE_DEVICES=0,1
command? Your problem rings a bell, but I can't recall how I resolved it.
I think GPUs are available properly.. did not tried this export command though as I managed to run successfully simillar tutorial but from Diffusers, which uses accelerate where in accelerate I set 2 gpus and via that it works fine on this machine.. yet I still would like to run this vanilla pytorch tutorial as I dont like diffusers much.. Will try tomorrow when training is done. Thanks for help anyways :)
I think GPUs are available properly.. did not tried this export command though as I managed to run successfully simillar tutorial but from Diffusers, which uses accelerate where in accelerate I set 2 gpus and via that it works fine on this machine.. yet I still would like to run this vanilla pytorch tutorial as I dont like diffusers much.. Will try tomorrow when training is done. Thanks for help anyways :)
HI @blacklig , were you able to run the vanilla pytorch tutorial on multi-GPU setup? I'm facing similar issues.
Hey! I'm trying to train on multiple GPUs and consistently getting the following RuntimeError. Here's the modified line in
tutorial_train.py
:As soon as I change
gpus
to1
, training works fine. Anyone have ideas?The error when training on >1 GPU
What I've tried