tubo213 / kaggle-child-mind-institute-detect-sleep-states

MIT License
115 stars 85 forks source link

[BUG] CenterNet error #46

Closed atamazian closed 10 months ago

atamazian commented 10 months ago

I'm getting the following error when I try to use CenterNet:

Epoch 0:   0% 0/119 [00:00<?, ?it/s] Error executing job with overrides: []
Traceback (most recent call last):
  File "/content/kaggle-child-mind-institute-detect-sleep-states/run/train.py", line 73, in main
    trainer.fit(model, datamodule=datamodule)
  File "/content/kaggle-child-mind-institute-detect-sleep-states/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 532, in fit
    call._call_and_handle_interrupt(
  File "/content/kaggle-child-mind-institute-detect-sleep-states/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/content/kaggle-child-mind-institute-detect-sleep-states/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 571, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/content/kaggle-child-mind-institute-detect-sleep-states/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 980, in _run
    results = self._run_stage()
  File "/content/kaggle-child-mind-institute-detect-sleep-states/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1023, in _run_stage
    self.fit_loop.run()
  File "/content/kaggle-child-mind-institute-detect-sleep-states/.venv/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 202, in run
    self.advance()
  File "/content/kaggle-child-mind-institute-detect-sleep-states/.venv/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 355, in advance
    self.epoch_loop.run(self._data_fetcher)
  File "/content/kaggle-child-mind-institute-detect-sleep-states/.venv/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 133, in run
    self.advance(data_fetcher)
  File "/content/kaggle-child-mind-institute-detect-sleep-states/.venv/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 219, in advance
    batch_output = self.automatic_optimization.run(trainer.optimizers[0], kwargs)
  File "/content/kaggle-child-mind-institute-detect-sleep-states/.venv/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 188, in run
    self._optimizer_step(kwargs.get("batch_idx", 0), closure)
  File "/content/kaggle-child-mind-institute-detect-sleep-states/.venv/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 266, in _optimizer_step
    call._call_lightning_module_hook(
  File "/content/kaggle-child-mind-institute-detect-sleep-states/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 146, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "/content/kaggle-child-mind-institute-detect-sleep-states/.venv/lib/python3.10/site-packages/pytorch_lightning/core/module.py", line 1276, in optimizer_step
    optimizer.step(closure=optimizer_closure)
  File "/content/kaggle-child-mind-institute-detect-sleep-states/.venv/lib/python3.10/site-packages/pytorch_lightning/core/optimizer.py", line 161, in step
    step_output = self._strategy.optimizer_step(self._optimizer, closure, **kwargs)
  File "/content/kaggle-child-mind-institute-detect-sleep-states/.venv/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 231, in optimizer_step
    return self.precision_plugin.optimizer_step(optimizer, model=model, closure=closure, **kwargs)
  File "/content/kaggle-child-mind-institute-detect-sleep-states/.venv/lib/python3.10/site-packages/pytorch_lightning/plugins/precision/amp.py", line 76, in optimizer_step
    closure_result = closure()
  File "/content/kaggle-child-mind-institute-detect-sleep-states/.venv/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 142, in __call__
    self._result = self.closure(*args, **kwargs)
  File "/content/kaggle-child-mind-institute-detect-sleep-states/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/content/kaggle-child-mind-institute-detect-sleep-states/.venv/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 128, in closure
    step_output = self._step_fn()
  File "/content/kaggle-child-mind-institute-detect-sleep-states/.venv/lib/python3.10/site-packages/pytorch_lightning/loops/optimization/automatic.py", line 315, in _training_step
    training_step_output = call._call_strategy_hook(trainer, "training_step", *kwargs.values())
  File "/content/kaggle-child-mind-institute-detect-sleep-states/.venv/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 294, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "/content/kaggle-child-mind-institute-detect-sleep-states/.venv/lib/python3.10/site-packages/pytorch_lightning/strategies/strategy.py", line 380, in training_step
    return self.model.training_step(*args, **kwargs)
  File "/content/kaggle-child-mind-institute-detect-sleep-states/src/modelmodule.py", line 53, in training_step
    output = self.model(batch["feature"], batch["label"], do_mixup, do_cutmix)
  File "/content/kaggle-child-mind-institute-detect-sleep-states/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/content/kaggle-child-mind-institute-detect-sleep-states/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/content/kaggle-child-mind-institute-detect-sleep-states/src/models/base.py", line 41, in forward
    loss = self.loss_fn(logits, labels)
  File "/content/kaggle-child-mind-institute-detect-sleep-states/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/content/kaggle-child-mind-institute-detect-sleep-states/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/content/kaggle-child-mind-institute-detect-sleep-states/src/models/centernet.py", line 51, in forward
    nonzero_idx_onset = labels[:, 4].nonzero().view(-1)
IndexError: index 4 is out of bounds for dimension 1 with size 3
tubo213 commented 10 months ago

Please specify dataset=centernet

rye run python run/train.py model=CenterNet dataset=centernet
atamazian commented 10 months ago

It works now, but losses are quite high:

Epoch 1: 100% 119/119 [03:01<00:00,  1.53s/it, v_num=4, val_loss=40.40, val_score=0.354, train_loss=1.06e+3]

Also I got:

Epoch 0:   0% 0/119 [00:00<?, ?it/s] /content/kaggle-child-mind-institute-detect-sleep-states/.venv/lib/python3.10/site-packages/torch/optim/lr_scheduler.py:136: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. "
atamazian commented 10 months ago

My model structure:

┏━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━┓
┃   ┃ Name                    ┃ Type           ┃ Params ┃
┡━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━┩
│ 0 │ model                   │ CenterNet      │ 23.7 M │
│ 1 │ model.feature_extractor │ CNNSpectrogram │ 13.0 K │
│ 2 │ model.encoder           │ Unet           │  6.3 M │
│ 3 │ model.decoder           │ UNet1DDecoder  │ 17.5 M │
│ 4 │ model.loss_fn           │ CenterNetLoss  │      0 │
└───┴─────────────────────────┴────────────────┴────────┘
tubo213 commented 10 months ago

@atamazian

That loss is correct. CenterNet loss divides by the number of objects, so the loss is larger.

tubo213 commented 10 months ago

@atamazian Please comment on Disscussion for a summary, as exchanges on github may fall under the category of private sharing.

atamazian commented 10 months ago

Done