When trying to finetune the Teyvat example on 2 GPU, the training stuck right after the first epoch starts to run.
ERRORS is like:
Epoch 0: 0%| | 0/8 [00:00<?[0/4347] /home/zhijue/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp.py:437: UserWarning: Error handling mechanism for deadlock detection is uninitialized. Skipping check. rank_zero_warn("Error handling mechanism for deadlock detection is uninitialized. Skipping check.") Summoning checkpoint. Traceback (most recent call last): File "/mydata/models/ColossalAI/examples/images/diffusion/main.py", line 804, in <module> trainer.fit(model, data) File "/home/zhijue/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 608, in fit call._call_and_handle_interrupt( File "/home/zhijue/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt return trainer_fn(*args, **kwargs) File "/home/zhijue/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _fit_impl self._run(model, ckpt_path=self.ckpt_path) File "/home/zhijue/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1112, in _run results = self._run_stage() File "/home/zhijue/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1191, in _run_stage self._run_train() File "/home/zhijue/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1214, in _run_train self.fit_loop.run() File "/home/zhijue/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(*args, **kwargs) File "/home/zhijue/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 267, in advance self._outputs = self.epoch_loop.run(self._data_fetcher) File "/home/zhijue/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(*args, **kwargs) File "/home/zhijue/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 213, in advance batch_output = self.batch_loop.run(kwargs) File "/home/zhijue/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(*args, **kwargs) File "/home/zhijue/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 88, in advance outputs = self.optimizer_loop.run(optimizers, kwargs) File "/home/zhijue/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(*args, **kwargs) File "/home/zhijue/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 202, in advance result = self._run_optimization(kwargs, self._optimizers[self.optim_progress.optimizer_position]) File "/home/zhijue/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 249, in _run_optimization self._optimizer_step(optimizer, opt_idx, kwargs.get("batch_idx", 0), closure) File "/home/zhijue/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 370, in _optimizer_step self.trainer._call_lightning_module_hook( File "/home/zhijue/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1356, in _call_lightning_module_hook output = fn(*args, **kwargs) File "/home/zhijue/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/core/module.py", line 1742, in optimizer_step optimizer.step(closure=optimizer_closure) File "/home/zhijue/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/core/optimizer.py", line 169, in step step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, **kwargs) File "/home/zhijue/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/strategies/colossalai.py", line 383, in optimizer_step return self.precision_plugin.optimizer_step( File "/home/zhijue/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/plugins/precision/colossalai.py", line 73, in optimizer_step closure_result = closure() File "/home/zhijue/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 149, in __call__ self._result = self.closure(*args, **kwargs) File "/home/zhijue/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 135, in closure step_output = self._step_fn() File "/home/zhijue/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 419, in _training_step training_step_output = self.trainer._call_strategy_hook("training_step", *kwargs.values()) File "/home/zhijue/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1494, in _call_strategy_hook output = fn(*args, **kwargs) File "/home/zhijue/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp.py", line 351, in training_step return self.model(*args, **kwargs) File "/home/zhijue/anaconda3/envs/ldm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/home/zhijue/anaconda3/envs/ldm/lib/python3.9/site-packages/colossalai/nn/parallel/data_parallel.py", line 282, in forward outputs = self.module(*args, **kwargs) File "/home/zhijue/anaconda3/envs/ldm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/home/zhijue/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/overrides/base.py", line 98, in forward output = self._forward_module.training_step(*inputs, **kwargs) File "/mydata/models/ColossalAI/examples/images/diffusion/ldm/models/diffusion/ddpm.py", line 474, in training_step loss, loss_dict = self.shared_step(batch) File "/mydata/models/ColossalAI/examples/images/diffusion/ldm/models/diffusion/ddpm.py", line 925, in shared_step loss = self(x, c) File "/home/zhijue/anaconda3/envs/ldm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/mydata/models/ColossalAI/examples/images/diffusion/ldm/models/diffusion/ddpm.py", line 937, in forward return self.p_losses(x, c, t, *args, **kwargs) File "/mydata/models/ColossalAI/examples/images/diffusion/ldm/models/diffusion/ddpm.py", line 994, in p_losses logvar_t = self.logvar[t].to(self.device) **RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)**
🐛 Describe the bug
When trying to finetune the Teyvat example on 2 GPU, the training stuck right after the first epoch starts to run.
ERRORS is like:
Epoch 0: 0%| | 0/8 [00:00<?[0/4347] /home/zhijue/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp.py:437: UserWarning: Error handling mechanism for deadlock detection is uninitialized. Skipping check. rank_zero_warn("Error handling mechanism for deadlock detection is uninitialized. Skipping check.") Summoning checkpoint. Traceback (most recent call last): File "/mydata/models/ColossalAI/examples/images/diffusion/main.py", line 804, in <module> trainer.fit(model, data) File "/home/zhijue/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 608, in fit call._call_and_handle_interrupt( File "/home/zhijue/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt return trainer_fn(*args, **kwargs) File "/home/zhijue/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 650, in _fit_impl self._run(model, ckpt_path=self.ckpt_path) File "/home/zhijue/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1112, in _run results = self._run_stage() File "/home/zhijue/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1191, in _run_stage self._run_train() File "/home/zhijue/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1214, in _run_train self.fit_loop.run() File "/home/zhijue/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(*args, **kwargs) File "/home/zhijue/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 267, in advance self._outputs = self.epoch_loop.run(self._data_fetcher) File "/home/zhijue/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(*args, **kwargs) File "/home/zhijue/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 213, in advance batch_output = self.batch_loop.run(kwargs) File "/home/zhijue/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(*args, **kwargs) File "/home/zhijue/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 88, in advance outputs = self.optimizer_loop.run(optimizers, kwargs) File "/home/zhijue/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run self.advance(*args, **kwargs) File "/home/zhijue/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 202, in advance result = self._run_optimization(kwargs, self._optimizers[self.optim_progress.optimizer_position]) File "/home/zhijue/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 249, in _run_optimization self._optimizer_step(optimizer, opt_idx, kwargs.get("batch_idx", 0), closure) File "/home/zhijue/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 370, in _optimizer_step self.trainer._call_lightning_module_hook( File "/home/zhijue/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1356, in _call_lightning_module_hook output = fn(*args, **kwargs) File "/home/zhijue/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/core/module.py", line 1742, in optimizer_step optimizer.step(closure=optimizer_closure) File "/home/zhijue/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/core/optimizer.py", line 169, in step step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, **kwargs) File "/home/zhijue/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/strategies/colossalai.py", line 383, in optimizer_step return self.precision_plugin.optimizer_step( File "/home/zhijue/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/plugins/precision/colossalai.py", line 73, in optimizer_step closure_result = closure() File "/home/zhijue/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 149, in __call__ self._result = self.closure(*args, **kwargs) File "/home/zhijue/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 135, in closure step_output = self._step_fn() File "/home/zhijue/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 419, in _training_step training_step_output = self.trainer._call_strategy_hook("training_step", *kwargs.values()) File "/home/zhijue/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1494, in _call_strategy_hook output = fn(*args, **kwargs) File "/home/zhijue/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp.py", line 351, in training_step return self.model(*args, **kwargs) File "/home/zhijue/anaconda3/envs/ldm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/home/zhijue/anaconda3/envs/ldm/lib/python3.9/site-packages/colossalai/nn/parallel/data_parallel.py", line 282, in forward outputs = self.module(*args, **kwargs) File "/home/zhijue/anaconda3/envs/ldm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/home/zhijue/anaconda3/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/overrides/base.py", line 98, in forward output = self._forward_module.training_step(*inputs, **kwargs) File "/mydata/models/ColossalAI/examples/images/diffusion/ldm/models/diffusion/ddpm.py", line 474, in training_step loss, loss_dict = self.shared_step(batch) File "/mydata/models/ColossalAI/examples/images/diffusion/ldm/models/diffusion/ddpm.py", line 925, in shared_step loss = self(x, c) File "/home/zhijue/anaconda3/envs/ldm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "/mydata/models/ColossalAI/examples/images/diffusion/ldm/models/diffusion/ddpm.py", line 937, in forward return self.p_losses(x, c, t, *args, **kwargs) File "/mydata/models/ColossalAI/examples/images/diffusion/ldm/models/diffusion/ddpm.py", line 994, in p_losses logvar_t = self.logvar[t].to(self.device) **RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)
**The config im using: ` model: base_learning_rate: 1.0e-4 target: ldm.models.diffusion.ddpm.LatentDiffusion params: parameterization: "v" linear_start: 0.00085 linear_end: 0.0120 num_timesteps_cond: 1 ckpt: /mydata/models/ColossalAI/examples/images/diffusion/checkpoints/512-base-ema.ckpt # use ckpt path log_every_t: 200 timesteps: 1000 first_stage_key: image cond_stage_key: txt image_size: 64 channels: 4 cond_stage_trainable: false conditioning_key: crossattn monitor: val/loss_simple_ema scale_factor: 0.18215 use_ema: true
data: target: main.DataModuleFromConfig params: batch_size: 16 num_workers: 4 train: target: ldm.data.teyvat.hf_dataset params: path: Fazzie/Teyvat image_transforms:
lightning: trainer: accelerator: 'gpu' devices: 2 log_gpu_memory: all max_epochs: 1 precision: 16 auto_select_gpus: True strategy: target: strategies.ColossalAIStrategy params: use_chunk: True enable_distributed_storage: True placement_policy: cuda force_outputs_fp32: true min_chunk_size: 32
logger_config: wandb: target: loggers.WandbLogger params: name: nowname save_dir: "/tmp/diff_log/" offline: opt.debug id: nowname
`
Environment
nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Wed_Jun__8_16:49:14_PDT_2022 Cuda compilation tools, release 11.7, V11.7.99 Build cuda_11.7.r11.7/compiler.31442593_0
GPU: 2x NV2080ti
finetuning command: python main.py --logdir /tmp -t -b configs/Teyvat/train_colossalai_teyvat.yaml
I only added