jbohnslav / deepethogram

Other
98 stars 33 forks source link

Sequence training error: ZeroDivisionError: float division by zero #92

Closed elizabethchiyka closed 2 years ago

elizabethchiyka commented 2 years ago

No issues before this in training, does anyone have suggestions on what the issue could be/things to try?

[2022-02-03 14:35:01,039] WARNING [deepethogram.projects.get_weightfile_from_cfg:1075] no sequence weights found...
[2022-02-03 14:35:01,039] INFO [__main__.sequence_train:63] Total trainable params: 267,573
[2022-02-03 14:35:01,039] INFO [deepethogram.feature_extractor.train.get_metrics:631] key metric: f1_class_mean
[2022-02-03 14:35:01,055] INFO [deepethogram.feature_extractor.losses.__init__:96] Focal loss: gamma 1.00 smoothing: 0.05
[2022-02-03 14:35:01,055] INFO [deepethogram.losses.get_regularization_loss:188] Regularization: L2. alpha: 0.01
[2022-02-03 14:35:01,055] INFO [deepethogram.base.__init__:89] scheduler mode: max
GPU available: True, used: True
TPU available: None, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
[2022-02-03 14:35:01,195] INFO [deepethogram.base.configure_optimizers:221] learning rate: 0.0001

  | Name       | Type               | Params
--------------------------------------------------
0 | model      | TGMJ               | 267 K
1 | activation | Sigmoid            | 0
2 | criterion  | ClassificationLoss | 0
--------------------------------------------------
267 K     Trainable params
0         Non-trainable params
267 K     Total params
Epoch 0:   0%|                                                                                  | 0/24 [00:00<?, ?it/s][2022-01-31 18:45:15,152] INFO [deepethogram.gui.main.log_idle:149] User has been idle for 60.0 seconds...
Epoch 0: 100%|█████████████████████████████████████████████████| 24/24 [00:42<00:00,  1.76s/it, loss=1.19e+04, v_num=0]{} {}dating: 100%|████████████████████████████████████████████████████████████████████████| 8/8 [00:27<00:00,  1.13s/it]
Epoch 1: 100%|█████████████████████████████████████████████████| 32/32 [01:16<00:00,  2.39s/it, loss=7.63e+03, v_num=0]{} {}dating:  88%|███████████████████████████████████████████████████████████████         | 7/8 [00:24<00:01,  1.04s/it]
Epoch 2: 100%|██████████████████████████████████████████████████| 32/32 [01:16<00:00,  2.39s/it, loss=6.8e+03, v_num=0]Saving latest checkpoint...██████████████████████████████████████████████████████         | 7/8 [00:24<00:01,  1.04s/it]
Epoch 2: 100%|██████████████████████████████████████████████████| 32/32 [01:17<00:00,  2.41s/it, loss=6.8e+03, v_num=0]
Traceback (most recent call last):
  File "C:\TOOLS\Anaconda3\envs\deg\lib\runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\TOOLS\Anaconda3\envs\deg\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\TOOLS\Anaconda3\envs\deg\lib\site-packages\deepethogram\sequence\train.py", line 265, in <module>
    sequence_train(cfg)
  File "C:\TOOLS\Anaconda3\envs\deg\lib\site-packages\deepethogram\sequence\train.py", line 75, in sequence_train
    trainer.fit(lightning_module)
  File "C:\TOOLS\Anaconda3\envs\deg\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 510, in fit
    results = self.accelerator_backend.train()
  File "C:\TOOLS\Anaconda3\envs\deg\lib\site-packages\pytorch_lightning\accelerators\accelerator.py", line 57, in train
    return self.train_or_test()
  File "C:\TOOLS\Anaconda3\envs\deg\lib\site-packages\pytorch_lightning\accelerators\accelerator.py", line 74, in train_or_test
    results = self.trainer.train()
  File "C:\TOOLS\Anaconda3\envs\deg\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 561, in train
    self.train_loop.run_training_epoch()
  File "C:\TOOLS\Anaconda3\envs\deg\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 625, in run_training_epoch
    self.trainer.run_evaluation(on_epoch=True)
  File "C:\TOOLS\Anaconda3\envs\deg\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 647, in run_evaluation
    self.evaluation_loop.on_evaluation_batch_end(output, batch, batch_idx, dataloader_idx)
  File "C:\TOOLS\Anaconda3\envs\deg\lib\site-packages\pytorch_lightning\trainer\evaluation_loop.py", line 307, in on_evaluation_batch_end
    self.trainer.call_hook('on_validation_batch_end', output, batch, batch_idx, dataloader_idx)
  File "C:\TOOLS\Anaconda3\envs\deg\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 926, in call_hook
    trainer_hook(*args, **kwargs)
  File "C:\TOOLS\Anaconda3\envs\deg\lib\site-packages\pytorch_lightning\trainer\callback_hook.py", line 157, in on_validation_batch_end
    callback.on_validation_batch_end(self, self.get_model(), outputs, batch, batch_idx, dataloader_idx)
  File "C:\TOOLS\Anaconda3\envs\deg\lib\site-packages\deepethogram\callbacks.py", line 101, in on_validation_batch_end
    self.end_batch('val', batch, pl_module)
  File "C:\TOOLS\Anaconda3\envs\deg\lib\site-packages\deepethogram\callbacks.py", line 87, in end_batch
    fps = n_images / elapsed
ZeroDivisionError: float division by zero
bwee commented 2 years ago

Was this on the sequence model training step? Are you using softmax final_activation?

elizabethchiyka commented 2 years ago

Yes it was on sequence model training, and no I was not using softmax final_activation. Edited original post to add very top of output, including sequence weight warning. Not sure if this is the root of the issue.

jbohnslav commented 2 years ago

This is odd, because elapsed should never be 0. maybe the batch didn't actually run properly? I added an epsilon in the bug_feb2022 branch, should merge in the next few days.

jbohnslav commented 2 years ago

should be fixed with e2df196, re-open if it doesn't fix it for you. pip install --upgrade deepethogram