Closed lcwLcw123 closed 1 year ago
it can work if I pip intsall pytorch==1.11.0+cu113, but come another problem!
/home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/pytorch_lightning/loggers/tensorboard.py:248: UserWarning: Could not log computational graph since the `model.example_input_array` attribute is not set or `input_array` was not given
rank_zero_warn(
/home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:224: PossibleUserWarning: The dataloader, train_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 12 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
rank_zero_warn(
Epoch 0: 0%| | 0/42156 [00:00<?, ?it/s]/home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:233: UserWarning: You called `self.log('global_step', ...)` in your `training_step` but the value needs to be floating point. Converting it to torch.float32.
warning_cache.warn(
[12/18/22 13:03:17] INFO colossalai - colossalai - INFO:
/home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/colossalai/zero/zero_optimizer.py:137
step
INFO colossalai - colossalai - INFO: Found overflow. Skip step
/home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/torch/optim/lr_scheduler.py:131: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. "
Epoch 0: 0%| | 1/42156 [00:02<24:24:45, 2.08s/it, loss=1.47, v_num=0, train/loss_simple_step=1.470, train/loss_vlb_step=1.470, [12/18/22 13:03:19] INFO colossalai - colossalai - INFO:
/home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/colossalai/zero/zero_optimizer.py:137
step
INFO colossalai - colossalai - INFO: Found overflow. Skip step
Epoch 0: 0%| | 2/42156 [00:03<22:15:38, 1.90s/it, loss=2.81, v_num=0, train/loss_simple_step=4.140, train/loss_vlb_step=4.140, Summoning checkpoint.
[12/18/22 13:03:22] INFO colossalai - ProcessGroup - INFO:
/home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/colossalai/tensor/process_group.py:24
get
INFO colossalai - ProcessGroup - INFO: NCCL initialize ProcessGroup on [0]
Traceback (most recent call last):
File "/home/liuchaowei/ColossalAI/examples/images/diffusion/main_ISP.py", line 805, in <module>
trainer.fit(model, data)
File "/home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 579, in fit
call._call_and_handle_interrupt(
File "/home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 36, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 90, in launch
return function(*args, **kwargs)
File "/home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 621, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File "/home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1058, in _run
results = self._run_stage()
File "/home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1137, in _run_stage
self._run_train()
File "/home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1160, in _run_train
self.fit_loop.run()
File "/home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 267, in advance
self._outputs = self.epoch_loop.run(self._data_fetcher)
File "/home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 214, in advance
batch_output = self.batch_loop.run(kwargs)
File "/home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 88, in advance
outputs = self.optimizer_loop.run(optimizers, kwargs)
File "/home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 200, in advance
result = self._run_optimization(kwargs, self._optimizers[self.optim_progress.optimizer_position])
File "/home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 247, in _run_optimization
self._optimizer_step(optimizer, opt_idx, kwargs.get("batch_idx", 0), closure)
File "/home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 357, in _optimizer_step
self.trainer._call_lightning_module_hook(
File "/home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1302, in _call_lightning_module_hook
output = fn(*args, **kwargs)
File "/home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/pytorch_lightning/core/module.py", line 1661, in optimizer_step
optimizer.step(closure=optimizer_closure)
File "/home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/pytorch_lightning/core/optimizer.py", line 169, in step
step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, **kwargs)
File "/home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/pytorch_lightning/strategies/colossalai.py", line 368, in optimizer_step
return self.precision_plugin.optimizer_step(
File "/home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/pytorch_lightning/plugins/precision/colossalai.py", line 81, in optimizer_step
optimizer.step()
File "/home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/colossalai/zero/zero_optimizer.py", line 142, in step
ret = self.optim.step(*args, **kwargs)
File "/home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/torch/optim/lr_scheduler.py", line 65, in wrapper
return wrapped(*args, **kwargs)
File "/home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/torch/optim/optimizer.py", line 88, in wrapper
return func(*args, **kwargs)
File "/home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/colossalai/nn/optimizer/hybrid_adam.py", line 143, in step
multi_tensor_applier(self.gpu_adam_op, self._dummy_overflow_buf, [g_l, p_l, m_l, v_l], group['lr'],
File "/home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/colossalai/utils/multi_tensor_apply/multi_tensor_apply.py", line 35, in __call__
return op(self.chunk_size,
RuntimeError: Cannot access data pointer of Tensor that doesn't have storage
Exception raised from data at /opt/conda/envs/3.9/lib/python3.9/site-packages/torch/include/c10/core/TensorImpl.h:1178 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fe2a01e27d2 in /home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x5f (0x7fe2a01def3f in /home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #2: <unknown function> + 0x21d1b (0x7fe2418fad1b in /home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/colossal_C.cpython-39-x86_64-linux-gnu.so)
frame #3: multi_tensor_adam_cuda(int, at::Tensor, std::vector<std::vector<at::Tensor, std::allocator<at::Tensor> >, std::allocator<std::vector<at::Tensor, std::allocator<at::Tensor> > > >, float, float, float, float, int, int, int, float) + 0x2e9 (0x7fe2418fb569 in /home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/colossal_C.cpython-39-x86_64-linux-gnu.so)
frame #4: <unknown function> + 0x1c211 (0x7fe2418f5211 in /home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/colossal_C.cpython-39-x86_64-linux-gnu.so)
frame #5: <unknown function> + 0x1819c (0x7fe2418f119c in /home/liuchaowei/anconda/envs/ldm1/lib/python3.9/site-packages/colossal_C.cpython-39-x86_64-linux-gnu.so)
<omitting python frames>
It seems your Cuda driver is not right
We have updated a lot. This issue was closed due to inactivity. Thanks.
🐛 Describe the bug
I have no idea why it has a bug "RuntimeError: CUDA error: no kernel image is available for execution on the device" when I am training the latent diffusion model in a super-resolution task. I really appreciate it if you could help me out.
Lightning config trainer: accelerator: gpu devices: 1 log_gpu_memory: all max_epochs: 3 precision: 16 auto_select_gpus: false strategy: target: strategies.ColossalAIStrategy params: use_chunk: true enable_distributed_storage: true placement_policy: cuda force_outputs_fp32: true log_every_n_steps: 3 logger: true default_root_dir: /tmp/diff_log/ logger_config: wandb: target: loggers.WandbLogger params: name: nowname save_dir: /tmp/diff_log/ offline: opt.debug id: nowname
/home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loggers/tensorboard.py:248: UserWarning: Could not log computational graph since the
model.example_input_array
attribute is not set orinput_array
was not given rank_zero_warn( /home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:224: PossibleUserWarning: The dataloader, train_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of thenum_workers
argument(try 12 which is the number of cpus on this machine) in the
DataLoaderinit to improve performance. rank_zero_warn( Epoch 0: 0%| | 0/42156 [00:00<?, ?it/s]/home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:233: UserWarning: You called
self.log('global_step', ...)in your
training_step` but the value needs to be floating point. Converting it to torch.float32. warning_cache.warn( Summoning checkpoint. [12/17/22 18:52:57] INFO colossalai - ProcessGroup - INFO:/home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/colossalai/tensor/process_group.py:24
get
INFO colossalai - ProcessGroup - INFO: NCCL initialize ProcessGroup on [0]
Traceback (most recent call last): File "/home/liuchaowei/ColossalAI/examples/images/diffusion/main_ISP.py", line 805, in
trainer.fit(model, data)
File "/home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 579, in fit
call._call_and_handle_interrupt(
File "/home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 36, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, kwargs)
File "/home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 90, in launch
return function(*args, *kwargs)
File "/home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 621, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File "/home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1058, in _run
results = self._run_stage()
File "/home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1137, in _run_stage
self._run_train()
File "/home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1160, in _run_train
self.fit_loop.run()
File "/home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(args, kwargs)
File "/home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 267, in advance
self._outputs = self.epoch_loop.run(self._data_fetcher)
File "/home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, kwargs)
File "/home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 214, in advance
batch_output = self.batch_loop.run(kwargs)
File "/home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, *kwargs)
File "/home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 88, in advance
outputs = self.optimizer_loop.run(optimizers, kwargs)
File "/home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(args, kwargs)
File "/home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 200, in advance
result = self._run_optimization(kwargs, self._optimizers[self.optim_progress.optimizer_position])
File "/home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 247, in _run_optimization
self._optimizer_step(optimizer, opt_idx, kwargs.get("batch_idx", 0), closure)
File "/home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 357, in _optimizer_step
self.trainer._call_lightning_module_hook(
File "/home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1302, in _call_lightning_module_hook
output = fn(args, kwargs)
File "/home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/core/module.py", line 1661, in optimizer_step
optimizer.step(closure=optimizer_closure)
File "/home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/core/optimizer.py", line 169, in step
step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, kwargs)
File "/home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/strategies/colossalai.py", line 368, in optimizer_step
return self.precision_plugin.optimizer_step(
File "/home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/pytorch_lightning/plugins/precision/colossalai.py", line 81, in optimizer_step
optimizer.step()
File "/home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/colossalai/zero/zero_optimizer.py", line 142, in step
ret = self.optim.step(args, kwargs)
File "/home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/torch/optim/lr_scheduler.py", line 65, in wrapper
return wrapped(*args, *kwargs)
File "/home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/torch/optim/optimizer.py", line 113, in wrapper
return func(args, kwargs)
File "/home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, kwargs)
File "/home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/colossalai/nn/optimizer/hybrid_adam.py", line 143, in step
multi_tensor_applier(self.gpu_adam_op, self._dummy_overflow_buf, [g_l, p_l, m_l, v_l], group['lr'],
File "/home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/colossalai/utils/multi_tensor_apply/multi_tensor_apply.py", line 35, in call
return op(self.chunk_size,
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.**
Exception raised from multi_tensor_apply at colossalai/kernel/cuda_native/csrc/multi_tensor_apply.cuh:111 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x3e (0x7f3e974e120e in /home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: + 0x21c67 (0x7f3e3abcfc67 in /home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/colossal_C.cpython-39-x86_64-linux-gnu.so)
frame #2: multi_tensor_adam_cuda(int, at::Tensor, std::vector<std::vector<at::Tensor, std::allocator >, std::allocator<std::vector<at::Tensor, std::allocator > > >, float, float, float, float, int, int, int, float) + 0x2e9 (0x7f3e3abd0569 in /home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/colossal_C.cpython-39-x86_64-linux-gnu.so)
frame #3: + 0x1c211 (0x7f3e3abca211 in /home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/colossal_C.cpython-39-x86_64-linux-gnu.so)
frame #4: + 0x1819c (0x7f3e3abc619c in /home/liuchaowei/anconda/envs/ldm/lib/python3.9/site-packages/colossal_C.cpython-39-x86_64-linux-gnu.so)