RuntimeError when training on multiple GPUs

andreemic commented 1 year ago

Hey! I'm trying to train on multiple GPUs and consistently getting the following RuntimeError. Here's the modified line in tutorial_train.py:

trainer = pl.Trainer(gpus=8, precision=32, callbacks=[logger])

As soon as I change gpus to 1, training works fine. Anyone have ideas?

The error when training on >1 GPU

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
/home/ubuntu/anaconda3/envs/control/lib/python3.8/site-packages/pytorch_lightning/trainer/configuration_validator.py:118: UserWarning: You defined a `validation_step` but have no `val_dataloader`. Skipping val loop.
  rank_zero_warn("You defined a `validation_step` but have no `val_dataloader`. Skipping val loop.")
/home/ubuntu/anaconda3/envs/control/lib/python3.8/site-packages/pytorch_lightning/trainer/configuration_validator.py:280: LightningDeprecationWarning: Base `LightningModule.on_train_batch_start` hook signature has changed in v1.5. The `dataloader_idx` argument will be removed in v1.7.
  rank_zero_deprecation(
/home/ubuntu/anaconda3/envs/control/lib/python3.8/site-packages/pytorch_lightning/trainer/configuration_validator.py:287: LightningDeprecationWarning: Base `Callback.on_train_batch_end` hook signature has changed in v1.5. The `dataloader_idx` argument will be removed in v1.7.
  rank_zero_deprecation(
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5,6,7]
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/ubuntu/anaconda3/envs/control/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/home/ubuntu/anaconda3/envs/control/lib/python3.8/multiprocessing/spawn.py", line 125, in _main
    prepare(preparation_data)
  File "/home/ubuntu/anaconda3/envs/control/lib/python3.8/multiprocessing/spawn.py", line 236, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "/home/ubuntu/anaconda3/envs/control/lib/python3.8/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
    main_content = runpy.run_path(main_path,
  File "/home/ubuntu/anaconda3/envs/control/lib/python3.8/runpy.py", line 265, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "/home/ubuntu/anaconda3/envs/control/lib/python3.8/runpy.py", line 97, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "/home/ubuntu/anaconda3/envs/control/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/ubuntu/ControlNet/tutorial_train.py", line 35, in <module>
    trainer.fit(model, dataloader)
  File "/home/ubuntu/anaconda3/envs/control/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 735, in fit
    self._call_and_handle_interrupt(
  File "/home/ubuntu/anaconda3/envs/control/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 682, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/ubuntu/anaconda3/envs/control/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 770, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/ubuntu/anaconda3/envs/control/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1193, in _run
    self._dispatch()
  File "/home/ubuntu/anaconda3/envs/control/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1272, in _dispatch
    self.training_type_plugin.start_training(self)
  File "/home/ubuntu/anaconda3/envs/control/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp_spawn.py", line 173, in start_training
    self.spawn(self.new_process, trainer, self.mp_queue, return_result=False)
  File "/home/ubuntu/anaconda3/envs/control/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp_spawn.py", line 201, in spawn
    mp.spawn(self._wrapped_function, args=(function, args, kwargs, return_queue), nprocs=self.num_processes)
  File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 189, in start_processes
    process.start()
  File "/home/ubuntu/anaconda3/envs/control/lib/python3.8/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/home/ubuntu/anaconda3/envs/control/lib/python3.8/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/home/ubuntu/anaconda3/envs/control/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/home/ubuntu/anaconda3/envs/control/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/home/ubuntu/anaconda3/envs/control/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 42, in _launch
    prep_data = spawn.get_preparation_data(process_obj._name)
  File "/home/ubuntu/anaconda3/envs/control/lib/python3.8/multiprocessing/spawn.py", line 154, in get_preparation_data
    _check_not_importing_main()
  File "/home/ubuntu/anaconda3/envs/control/lib/python3.8/multiprocessing/spawn.py", line 134, in _check_not_importing_main
    raise RuntimeError('''
RuntimeError: 
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.

What I've tried

clean-install conda environment
with/without triton, xformers

LomaxK commented 1 year ago

Me too. After it print LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3], cuda just cored dump

tg-bomze commented 1 year ago

Wrap lines 20-35 in if __name__ == '__main__': and run training CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python tutorial_train.py

andreemic commented 1 year ago

lines 20-35 of what file?

tg-bomze commented 1 year ago

lines 20-35 of what file?

The file you originally wrote about. tutorial_train.py

qingfengmingyue commented 1 year ago

lines 20-35 of what file?

The file you originally wrote about. tutorial_train.py

Still have the same problem

kernelguardian commented 1 year ago

Had the same issue, and this fixed it. Although I changed the trainer code to this but explicitly specifying to use the second GPU

trainer = pl.Trainer(accelerator="gpu", devices=[1], precision=32, callbacks=[logger])

Wrap lines 20-35 in if __name__ == '__main__': and run training CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python tutorial_train.py

JunnYu commented 1 year ago

accelerator_connector.py:287: LightningDeprecationWarning: PassingTrainer(accelerator='ddp')has been deprecated in v1.5 and will be removed in v1.7. UseTrainer(strategy='ddp')instead.

I think we shoulde use strategy='ddp'

SuroshAhmadZobair commented 1 year ago

Hi @JunnYu This is how I do it. trainer = pl.Trainer(strategy="ddp", accelerator="gpu", devices=[0,1], precision=32, callbacks=[logger],max_epochs=max_epochs)

I am running it on 2X3090s.

MikaYeghi commented 10 months ago

I think the final answer is a combination of 3 answers above: by @tg-bomze, @SuroshAhmadZobair and @JunnYu . You need to apply the following modifications to the original tutorial_train.py script:

Wrap the lines 20-35 of tutorial_train.py with if __name__ == "__main__":.
Make sure the GPU-s that you want to use are visible in the terminal by running export CUDA_VISIBLE_DEVICES=0,1. If you have more GPU-s, or wish to use specific GPU-s, feel free to use your own ID-s, for example export CUDA_VISIBLE_DEVICES=0,3,6,7. If you only have 2 GPU-s, then use the first command.
Modify the training script to use the specified number of GPU-s by modifying the initialization of the pl.Trainer object, i.e. trainer = pl.Trainer(gpus=2, precision=32, callbacks=[logger]).
It also turns out you need to set the strategy when initializing the pl.Trainer object, so you need to modify that line of code again to get trainer = pl.Trainer(strategy='ddp', gpus=2, precision=32, callbacks=[logger]).

Eventually, your entire training script should look like this:

from share import *

import pytorch_lightning as pl
from torch.utils.data import DataLoader
from tutorial_dataset import MyDataset
from cldm.logger import ImageLogger
from cldm.model import create_model, load_state_dict

# Configs
resume_path = './models/control_sd15_ini.ckpt'
batch_size = 2
logger_freq = 300
learning_rate = 1e-5
sd_locked = True
only_mid_control = False

if __name__ == "__main__":
    # First use cpu to load models. Pytorch Lightning will automatically move it to GPUs.
    model = create_model('./models/cldm_v15.yaml').cpu()
    model.load_state_dict(load_state_dict(resume_path, location='cpu'))
    model.learning_rate = learning_rate
    model.sd_locked = sd_locked
    model.only_mid_control = only_mid_control

    # Misc
    dataset = MyDataset()
    dataloader = DataLoader(dataset, num_workers=0, batch_size=batch_size, shuffle=True)
    logger = ImageLogger(batch_frequency=logger_freq)
    trainer = pl.Trainer(strategy='ddp', gpus=2, precision=32, callbacks=[logger])

    # Train!
    trainer.fit(model, dataloader)

As you can see, the modified script does not differ much from the original script. Once you have everything setup, simply run the training command:

python tutorial_train.py

It worked for me on 2xQuadro RTX 6000.

P.S. I had to reduce the batch size from 4 to 2 to make sure I don't get an out-of memory error. If you have enough memory, you can stick to batch size 4. P.P.S. As I said earlier, I had to combine several previous responses to make things work. I thought it would be easier for others if these answers were merged into one. It is also possible that different people have different problems when running things, so I cannot guarantee that this solution is 100% correct. In any case I hope it helps.

blacklig commented 10 months ago

So I did what @MikaYeghi - total copy paste , yet my code is still not running properly on 2xRTX 3090

If I run it on 2 GPUs it outputs this and hangs in there:


No module 'xformers'. Proceeding without it.
ControlLDM: Running in eps-prediction mode
DiffusionWrapper has 859.52 M params.
making attention of type 'vanilla' with 512 in_channels
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
making attention of type 'vanilla' with 512 in_channels
Loaded model config from [./models/cldm_v15.yaml]
Loaded state_dict from [./models/control_sd15_ini.ckpt]
initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 2 processes
----------------------------------------------------------------------------------------------------

If I run it with only 1 GPU - like default tutorial, it runs well and output is like this:

No module 'xformers'. Proceeding without it.
ControlLDM: Running in eps-prediction mode
DiffusionWrapper has 859.52 M params.
making attention of type 'vanilla' with 512 in_channels
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
making attention of type 'vanilla' with 512 in_channels
Loaded model config from [./models/cldm_v15.yaml]
Loaded state_dict from [./models/control_sd15_ini.ckpt]
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
/home/sd4/anaconda3/envs/control/lib/python3.8/site-packages/pytorch_lightning/trainer/configuration_validator.py:118: UserWarning: You defined a `validation_step` but have no `val_dataloader`. Skipping val loop.
  rank_zero_warn("You defined a `validation_step` but have no `val_dataloader`. Skipping val loop.")
/home/sd4/anaconda3/envs/control/lib/python3.8/site-packages/pytorch_lightning/trainer/configuration_validator.py:280: LightningDeprecationWarning: Base `LightningModule.on_train_batch_start` hook signature has changed in v1.5. The `dataloader_idx` argument will be removed in v1.7.
  rank_zero_deprecation(
/home/sd4/anaconda3/envs/control/lib/python3.8/site-packages/pytorch_lightning/trainer/configuration_validator.py:287: LightningDeprecationWarning: Base `Callback.on_train_batch_end` hook signature has changed in v1.5. The `dataloader_idx` argument will be removed in v1.7.
  rank_zero_deprecation(
initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]

  | Name              | Type               | Params
---------------------------------------------------------
0 | model             | DiffusionWrapper   | 859 M
1 | first_stage_model | AutoencoderKL      | 83.7 M
2 | cond_stage_model  | FrozenCLIPEmbedder | 123 M
3 | control_model     | ControlNet         | 361 M
---------------------------------------------------------
1.2 B     Trainable params
206 M     Non-trainable params
1.4 B     Total params
5,710.058 Total estimated model params size (MB)
/home/sd4/anaconda3/envs/control/lib/python3.8/site-packages/pytorch_lightning/trainer/data_loading.py:110: UserWarning: The dataloader, train_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 12 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  rank_zero_warn(
Epoch 0:   0%|                                        | 0/25000 [00:00<?, ?it/s]/home/sd4/anaconda3/envs/control/lib/python3.8/site-packages/pytorch_lightning/utilities/data.py:56: UserWarning: Trying to infer the `batch_size` from an ambiguous collection. The batch size we found is 2. To avoid any miscalculations, use `self.log(..., batch_size=batch_size)`.
  warning_cache.warn(
Data shape for DDIM sampling is (2, 4, 64, 64), eta 0.0
Running DDIM Sampling with 50 timesteps
DDIM Sampler: 100%|█████████████████████████████| 50/50 [00:35<00:00,  1.39it/s]
Epoch 0:   0%| | 7/25000 [00:49<48:54:16,  7.04s/it, loss=0.0126, v_num=26, trai

any ideas?

What I also noticed is with 2 GPU setup, my cards somehow run to 100% GPU, but actually only around 170W out of 350 or 420 (one has 350 and second 420 W). Also VRAM usage goes only to 6-7 GB instead of 22 when running only on 1 GPU properly. Any ideas what can be wrong? Tried many things yesterday..

update:

just got idea if Conda could be problem? Should I stick with pip venv ? Or CUDA? I have I believe 12.3 with PyTorch compiled for 12.1.

MikaYeghi commented 10 months ago

So I did what @MikaYeghi - total copy paste , yet my code is still not running properly on 2xRTX 3090

If I run it on 2 GPUs it outputs this and hangs in there:


No module 'xformers'. Proceeding without it.
ControlLDM: Running in eps-prediction mode
DiffusionWrapper has 859.52 M params.
making attention of type 'vanilla' with 512 in_channels
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
making attention of type 'vanilla' with 512 in_channels
Loaded model config from [./models/cldm_v15.yaml]
Loaded state_dict from [./models/control_sd15_ini.ckpt]
initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 2 processes
----------------------------------------------------------------------------------------------------

If I run it with only 1 GPU - like default tutorial, it runs well and output is like this:

No module 'xformers'. Proceeding without it.
ControlLDM: Running in eps-prediction mode
DiffusionWrapper has 859.52 M params.
making attention of type 'vanilla' with 512 in_channels
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
making attention of type 'vanilla' with 512 in_channels
Loaded model config from [./models/cldm_v15.yaml]
Loaded state_dict from [./models/control_sd15_ini.ckpt]
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
/home/sd4/anaconda3/envs/control/lib/python3.8/site-packages/pytorch_lightning/trainer/configuration_validator.py:118: UserWarning: You defined a `validation_step` but have no `val_dataloader`. Skipping val loop.
  rank_zero_warn("You defined a `validation_step` but have no `val_dataloader`. Skipping val loop.")
/home/sd4/anaconda3/envs/control/lib/python3.8/site-packages/pytorch_lightning/trainer/configuration_validator.py:280: LightningDeprecationWarning: Base `LightningModule.on_train_batch_start` hook signature has changed in v1.5. The `dataloader_idx` argument will be removed in v1.7.
  rank_zero_deprecation(
/home/sd4/anaconda3/envs/control/lib/python3.8/site-packages/pytorch_lightning/trainer/configuration_validator.py:287: LightningDeprecationWarning: Base `Callback.on_train_batch_end` hook signature has changed in v1.5. The `dataloader_idx` argument will be removed in v1.7.
  rank_zero_deprecation(
initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]

  | Name              | Type               | Params
---------------------------------------------------------
0 | model             | DiffusionWrapper   | 859 M
1 | first_stage_model | AutoencoderKL      | 83.7 M
2 | cond_stage_model  | FrozenCLIPEmbedder | 123 M
3 | control_model     | ControlNet         | 361 M
---------------------------------------------------------
1.2 B     Trainable params
206 M     Non-trainable params
1.4 B     Total params
5,710.058 Total estimated model params size (MB)
/home/sd4/anaconda3/envs/control/lib/python3.8/site-packages/pytorch_lightning/trainer/data_loading.py:110: UserWarning: The dataloader, train_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 12 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  rank_zero_warn(
Epoch 0:   0%|                                        | 0/25000 [00:00<?, ?it/s]/home/sd4/anaconda3/envs/control/lib/python3.8/site-packages/pytorch_lightning/utilities/data.py:56: UserWarning: Trying to infer the `batch_size` from an ambiguous collection. The batch size we found is 2. To avoid any miscalculations, use `self.log(..., batch_size=batch_size)`.
  warning_cache.warn(
Data shape for DDIM sampling is (2, 4, 64, 64), eta 0.0
Running DDIM Sampling with 50 timesteps
DDIM Sampler: 100%|█████████████████████████████| 50/50 [00:35<00:00,  1.39it/s]
Epoch 0:   0%| | 7/25000 [00:49<48:54:16,  7.04s/it, loss=0.0126, v_num=26, trai

any ideas?

What I also noticed is with 2 GPU setup, my cards somehow run to 100% GPU, but actually only around 170W out of 350 or 420 (one has 350 and second 420 W). Also VRAM usage goes only to 6-7 GB instead of 22 when running only on 1 GPU properly. Any ideas what can be wrong? Tried many things yesterday..

update:

just got idea if Conda could be problem? Should I stick with pip venv ? Or CUDA? I have I believe 12.3 with PyTorch compiled for 12.1.

I think I also had a similar issue at some point, but I resolved it easily... Can you check if you have all the GPUs available and visible? Maybe try running the export CUDA_VISIBLE_DEVICES=0,1 command? Your problem rings a bell, but I can't recall how I resolved it.

blacklig commented 10 months ago

I think GPUs are available properly.. did not tried this export command though as I managed to run successfully simillar tutorial but from Diffusers, which uses accelerate where in accelerate I set 2 gpus and via that it works fine on this machine.. yet I still would like to run this vanilla pytorch tutorial as I dont like diffusers much.. Will try tomorrow when training is done. Thanks for help anyways :)

aashishrai3799 commented 9 months ago

I think GPUs are available properly.. did not tried this export command though as I managed to run successfully simillar tutorial but from Diffusers, which uses accelerate where in accelerate I set 2 gpus and via that it works fine on this machine.. yet I still would like to run this vanilla pytorch tutorial as I dont like diffusers much.. Will try tomorrow when training is done. Thanks for help anyways :)

HI @blacklig , were you able to run the vanilla pytorch tutorial on multi-GPU setup? I'm facing similar issues.

lllyasviel / ControlNet

RuntimeError when training on multiple GPUs #165

The error when training on >1 GPU

What I've tried