Closed plsuarez closed 2 months ago
Sorry, I can't reproduce this issue. But checking the input and output of the function get_input(self, batch, k) in file ecdm_first_stage.py (line528-534) may help.
Yes, I checked, but the code is not validating Nontype objects. Also, I have to change the return of your dataloader because, the model was expecting 6 channels and in the code you convert the RGB image to grayscale, so, the model was receiving only less channels than expected. Can you update your code to a version that function properly, please?
I am trying to execute this : python main.py fit -c configs/base_config.yaml -c configs/ecdm_first_stage.yaml --trainer.devices 0,1
I have updated the code for the channel mismatch issue.
Thanks!!, but now I have this error:
File "main.py", line 139, in ModelCheckpoint(monitor='val/loss_simple_ema')
could not find the monitored key in the returned metrics: ['train/loss', 'train/loss_step', 'global_step', 'lr_abs', 'val/loss', 'val/loss_ema', 'train/loss_epoch', 'epoch', 'step']. HINT: Did you call log('val/loss_simple_ema', value)
in the LightningModule
?
raise MisconfigurationException(m)
lightning.fabric.utilities.exceptions.MisconfigurationException: ModelCheckpoint(monitor='val/loss_simple_ema')
could not find the monitored key in the returned metrics: ['train/loss', 'train/loss_step', 'global_step', 'lr_abs', 'val/loss', 'val/loss_ema', 'train/loss_epoch', 'epoch', 'step']. HINT: Did you call log('val/loss_simple_ema', value)
in the LightningModule
?
Epoch 0: 100%|██████████| 6013/6013 [7:05:19<00:00, 0.24it/s, v_num=0, train/loss_step=0.00798, global_step=6012.0, lr_abs=4.5e-6, train/loss_epoch=0.0456]
Could you please check..... what it is happenened!!!
Thanks in advance...
I have pushed a commit to fix this bug.
File "/root/anaconda3/lib/python3.8/site-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, kwargs) File "/root/anaconda3/lib/python3.8/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch return function(*args, *kwargs) File "/root/anaconda3/lib/python3.8/site-packages/lightning/pytorch/trainer/trainer.py", line 580, in _fit_impl return module_to_run(inputs[0], kwargs[0]) # type: ignore[index] File "/root/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl self._run(model, ckpt_path=ckpt_path) File "/root/anaconda3/lib/python3.8/site-packages/lightning/pytorch/trainer/trainer.py", line 987, in _run results = self._run_stage() File "/root/anaconda3/lib/python3.8/site-packages/lightning/pytorch/trainer/trainer.py", line 1033, in _run_stage self.fit_loop.run() File "/root/anaconda3/lib/python3.8/site-packages/lightning/pytorch/loops/fit_loop.py", line 205, in run self.advance() File "/root/anaconda3/lib/python3.8/site-packages/lightning/pytorch/loops/fit_loop.py", line 363, in advance return forward_call(*args, kwargs) File "/root/anaconda3/lib/python3.8/site-packages/lightning/pytorch/strategies/strategy.py", line 635, in wrapped_forward self.epoch_loop.run(self._data_fetcher) File "/root/anaconda3/lib/python3.8/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 140, in run self.advance(data_fetcher) File "/root/anaconda3/lib/python3.8/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 250, in advance batch_output = self.automatic_optimization.run(trainer.optimizers[0], batch_idx, kwargs) File "/root/anaconda3/lib/python3.8/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 190, in run self._optimizer_step(batch_idx, closure) File "/root/anaconda3/lib/python3.8/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 268, in _optimizer_step out = method(*_args, *_kwargs) File "/mnt/ECDM/ecdm/models/diffusion/ecdm_first_stage.py", line 581, in training_step call._call_lightning_module_hook( File "/root/anaconda3/lib/python3.8/site-packages/lightning/pytorch/trainer/call.py", line 157, in _call_lightning_module_hook output = fn(args, kwargs) File "/root/anaconda3/lib/python3.8/site-packages/lightning/pytorch/core/module.py", line 1303, in optimizer_step loss, loss_dict = self.shared_step(batch) File "/mnt/ECDM/ecdm/models/diffusion/ecdm_first_stage.py", line 577, in shared_step optimizer.step(closure=optimizer_closure) File "/root/anaconda3/lib/python3.8/site-packages/lightning/pytorch/core/optimizer.py", line 152, in step step_output = self._strategy.optimizer_step(self._optimizer, closure, kwargs) File "/root/anaconda3/lib/python3.8/site-packages/lightning/pytorch/strategies/ddp.py", line 270, in optimizer_step optimizer_output = super().optimizer_step(optimizer, closure, model, kwargs) File "/root/anaconda3/lib/python3.8/site-packages/lightning/pytorch/strategies/strategy.py", line 239, in optimizer_step loss, loss_dict = self(x, cond=cond) File "/root/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, kwargs) File "/root/anaconda3/lib/python3.8/site-packages/lightning/pytorch/plugins/precision/precision.py", line 122, in optimizer_step return optimizer.step(closure=closure, kwargs) File "/root/anaconda3/lib/python3.8/site-packages/torch/optim/optimizer.py", line 280, in wrapper out = func(*args, kwargs) File "/root/anaconda3/lib/python3.8/site-packages/torch/optim/optimizer.py", line 33, in _use_grad ret = func(self, *args, *kwargs) File "/root/anaconda3/lib/python3.8/site-packages/torch/optim/adamw.py", line 148, in step loss = closure() File "/root/anaconda3/lib/python3.8/site-packages/lightning/pytorch/plugins/precision/precision.py", line 108, in _wrap_closure closure_result = closure() File "/root/anaconda3/lib/python3.8/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 144, in call self._result = self.closure(args, kwargs) File "/root/anaconda3/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, kwargs) File "/root/anaconda3/lib/python3.8/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 129, in closure step_output = self._step_fn() File "/root/anaconda3/lib/python3.8/site-packages/lightning/pytorch/loops/optimization/automatic.py", line 318, in _training_step return forward_call(*args, kwargs) File "/mnt/ECDM/ecdm/models/diffusion/ecdm_first_stage.py", line 524, in forward training_step_output = call._call_strategy_hook(trainer, "training_step", kwargs.values()) File "/root/anaconda3/lib/python3.8/site-packages/lightning/pytorch/trainer/call.py", line 309, in _call_strategy_hook output = fn(args, kwargs) File "/root/anaconda3/lib/python3.8/site-packages/lightning/pytorch/strategies/strategy.py", line 390, in training_step 0, self.num_timesteps, (x.shape[0],), device=self.device AttributeError: 'NoneType' object has no attribute 'shape' return self._forward_redirection(self.model, self.lightning_module, "training_step", *args, kwargs) File "/root/anaconda3/lib/python3.8/site-packages/lightning/pytorch/strategies/strategy.py", line 642, in call wrapper_output = wrapper_module(*args, kwargs) File "/root/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, *kwargs) File "/root/anaconda3/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1156, in forward output = self._run_ddp_forward(inputs, kwargs) File "/root/anaconda3/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1110, in _run_ddp_forward return module_to_run(*inputs[0], kwargs[0]) # type: ignore[index] File "/root/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, *kwargs) File "/root/anaconda3/lib/python3.8/site-packages/lightning/pytorch/strategies/strategy.py", line 635, in wrapped_forward out = method(_args, _kwargs) File "/mnt/ECDM/ecdm/models/diffusion/ecdm_first_stage.py", line 581, in training_step loss, loss_dict = self.shared_step(batch) File "/mnt/ECDM/ecdm/models/diffusion/ecdm_first_stage.py", line 577, in shared_step loss, loss_dict = self(x, cond=cond) File "/root/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/mnt/ECDM/ecdm/models/diffusion/ecdm_first_stage.py", line 524, in forward 0, self.num_timesteps, (x.shape[0],), device=self.device AttributeError: 'NoneType' object has no attribute 'shape' Epoch 0: 0%| | 0/6013 [00:01<?, ?it/s]