jbohnslav / deepethogram

Other
98 stars 32 forks source link

Sequence training error: "RuntimeError: Expected floating point type for target with class probabilities, got Long" #88

Open bwee opened 2 years ago

bwee commented 2 years ago

Hi,

After training and inferring the feature extractor, I went to train the sequence model and got this error:

[2022-01-09 14:11:57,297] INFO [deepethogram.data.utils.make_loss_weight:114] Class counts: [180001 10138 211 19003 273 2372 3463 3592 47371] [2022-01-09 14:11:57,297] INFO [deepethogram.data.utils.make_loss_weight:115] Pos weight: [ 0.48012511 25.27973959 1261.67298578 13.02010209 974.91208791 111.32040472 75.9344499 73.1714922 4.62420046] [2022-01-09 14:11:57,298] INFO [deepethogram.data.utils.make_loss_weight:116] Pos weight, weighted: [ 0.6929106 5.027896 35.520035 3.6083379 31.223581 10.550849 8.714038 8.554033 2.1503954] [2022-01-09 14:11:57,298] INFO [deepethogram.data.utils.make_loss_weight:117] Softmax weight: [0.00058057 0.01030814 0.49527939 0.00549934 0.38279835 0.04405731 0.03017729 0.02909353 0.00220607] [2022-01-09 14:11:57,298] INFO [deepethogram.data.utils.make_loss_weight:118] Softmax weight transformed: [0.02409511 0.10152902 0.70376086 0.07415753 0.618707 0.20989834 0.17371611 0.17056824 0.04696887] TGMJ( (input_dropout): Dropout(p=0.5, inplace=False) (output_dropout): Dropout(p=0.5, inplace=False) (tgm_layers): Sequential( (0): TGMLayer() (1): TGMLayer() (2): TGMLayer() ) (h): Conv1d(1024, 128, kernel_size=(1,), stride=(1,)) (h2): Conv1d(1024, 128, kernel_size=(1,), stride=(1,)) (classify1): Sequential( (0): Conv1d(128, 9, kernel_size=(1,), stride=(1,), bias=False) (1): BatchNorm1d(9, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) (classify2): Sequential( (0): Conv1d(128, 9, kernel_size=(1,), stride=(1,), bias=False) (1): BatchNorm1d(9, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) ) ) [2022-01-09 14:11:57,427] WARNING [deepethogram.projects.get_weightfile_from_cfg:1075] no sequence weights found... [2022-01-09 14:11:57,428] INFO [main.sequence_train:63] Total trainable params: 266,019 [2022-01-09 14:11:57,429] INFO [deepethogram.feature_extractor.train.get_metrics:631] key metric: f1_class_mean [2022-01-09 14:11:57,432] INFO [deepethogram.losses.get_regularization_loss:188] Regularization: L2. alpha: 0.01 C:\Users######\Anaconda3\envs\deg\lib\site-packages\pytorch_lightning\utilities\distributed.py:50: DeprecationWarning: The setter for self.hparams in LightningModule is deprecated since v1.1.0 and will be removed in v1.3.0. Replace the assignment self.hparams = hparams with self.save_hyperparameters(). warnings.warn(args, kwargs) [2022-01-09 14:11:57,440] INFO [deepethogram.base.init:89] scheduler mode: max GPU available: True, used: True TPU available: None, using: 0 TPU cores LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] [2022-01-09 14:11:57,456] INFO [deepethogram.base.get_trainer_from_cfg:318] max trials: 3 [2022-01-09 14:11:57,567] INFO [deepethogram.base.configure_optimizers:221] learning rate: 0.0001 Traceback (most recent call last): File "C:\Users######\Anaconda3\envs\deg\lib\runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "C:\Users######\Anaconda3\envs\deg\lib\runpy.py", line 85, in _run_code exec(code, run_globals) File "C:\Users######\Anaconda3\envs\deg\lib\site-packages\deepethogram\sequence\train.py", line 265, in sequence_train(cfg) File "C:\Users######\Anaconda3\envs\deg\lib\site-packages\deepethogram\sequence\train.py", line 74, in sequence_train trainer = get_trainer_from_cfg(cfg, lightning_module, stopper) File "C:\Users######\Anaconda3\envs\deg\lib\site-packages\deepethogram\base.py", line 323, in get_trainer_from_cfg max_trials=max_trials) File "C:\Users######\Anaconda3\envs\deg\lib\site-packages\pytorch_lightning\tuner\tuning.py", line 114, in scale_batch_size fit_kwargs, File "C:\Users######\Anaconda3\envs\deg\lib\site-packages\pytorch_lightning\tuner\batch_size_scaling.py", line 109, in scale_batch_size new_size = _run_power_scaling(trainer, model, new_size, batch_arg_name, max_trials, fit_kwargs) File "C:\Users######\Anaconda3\envs\deg\lib\site-packages\pytorch_lightning\tuner\batch_size_scaling.py", line 183, in _run_power_scaling trainer.fit(model, fit_kwargs) File "C:\Users######\Anaconda3\envs\deg\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 510, in fit results = self.accelerator_backend.train() File "C:\Users######\Anaconda3\envs\deg\lib\site-packages\pytorch_lightning\accelerators\accelerator.py", line 57, in train return self.train_or_test() File "C:\Users######\Anaconda3\envs\deg\lib\site-packages\pytorch_lightning\accelerators\accelerator.py", line 74, in train_or_test results = self.trainer.train() File "C:\Users######\Anaconda3\envs\deg\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 561, in train self.train_loop.run_training_epoch() File "C:\Users######\Anaconda3\envs\deg\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 550, in run_training_epoch batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx) File "C:\Users######\Anaconda3\envs\deg\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 718, in run_training_batch self.optimizer_step(optimizer, opt_idx, batch_idx, train_step_and_backward_closure) File "C:\Users######\Anaconda3\envs\deg\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 493, in optimizer_step using_lbfgs=is_lbfgs, File "C:\Users######\Anaconda3\envs\deg\lib\site-packages\pytorch_lightning\core\lightning.py", line 1298, in optimizer_step optimizer.step(closure=optimizer_closure) File "C:\Users######\Anaconda3\envs\deg\lib\site-packages\pytorch_lightning\core\optimizer.py", line 286, in step self.__optimizer_step(args, closure=closure, profiler_name=profiler_name, kwargs) File "C:\Users######\Anaconda3\envs\deg\lib\site-packages\pytorch_lightning\core\optimizer.py", line 144, in __optimizer_step optimizer.step(closure=closure, *args, *kwargs) File "C:\Users######\Anaconda3\envs\deg\lib\site-packages\torch\optim\optimizer.py", line 88, in wrapper return func(args, kwargs) File "C:\Users######\Anaconda3\envs\deg\lib\site-packages\torch\autograd\grad_mode.py", line 28, in decorate_context return func(*args, kwargs) File "C:\Users######\Anaconda3\envs\deg\lib\site-packages\torch\optim\adam.py", line 92, in step loss = closure() File "C:\Users######\Anaconda3\envs\deg\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 713, in train_step_and_backward_closure self.trainer.hiddens File "C:\Users######\Anaconda3\envs\deg\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 806, in training_step_and_backward result = self.training_step(split_batch, batch_idx, opt_idx, hiddens) File "C:\Users######\Anaconda3\envs\deg\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 319, in training_step training_step_output = self.trainer.accelerator_backend.training_step(args) File "C:\Users######\Anaconda3\envs\deg\lib\site-packages\pytorch_lightning\accelerators\gpu_accelerator.py", line 70, in training_step return self._step(self.trainer.model.training_step, args) File "C:\Users######\Anaconda3\envs\deg\lib\site-packages\pytorch_lightning\accelerators\gpu_accelerator.py", line 65, in _step output = model_step(args) File "C:\Users######\Anaconda3\envs\deg\lib\site-packages\deepethogram\sequence\train.py", line 138, in training_step return self.common_step(batch, batch_idx, 'train') File "C:\Users######\Anaconda3\envs\deg\lib\site-packages\deepethogram\sequence\train.py", line 118, in common_step loss, loss_dict = self.criterion(outputs, batch['labels'], self.model) File "C:\Users######\Anaconda3\envs\deg\lib\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl return forward_call(input, kwargs) File "C:\Users######\Anaconda3\envs\deg\lib\site-packages\deepethogram\feature_extractor\losses.py", line 207, in forward data_loss = self.data_criterion(outputs, label) File "C:\Users######\Anaconda3\envs\deg\lib\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl return forward_call(*input, kwargs) File "C:\Users######\Anaconda3\envs\deg\lib\site-packages\torch\nn\modules\loss.py", line 1152, in forward label_smoothing=self.label_smoothing) File "C:\Users######\Anaconda3\envs\deg\lib\site-packages\torch\nn\functional.py", line 2846, in cross_entropy return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing) RuntimeError: Expected floating point type for target with class probabilities, got Long [2022-01-09 14:12:04,867] INFO [deepethogram.gui.main.sequence_train:496] Training finished. If you see error messages above, training did not complete successfully.** [2022-01-09 14:12:04,867] INFO [deepethogram.gui.main.sequence_train:501] ~~~~~~~~~~~~~~~~~~~~ [2022-01-09 14:12:05,037] INFO [deepethogram.gui.main.project_loaded_buttons:173] Number finalized labels: 5

Not sure if there's an easy fix for this. Please let me know when you get a chance, and thanks!

bwee commented 2 years ago

Reinstalling everything fixed this issue but added a new one. Closing this and creating a new bug report

bwee commented 2 years ago

Issue is back. Please let me know if you have any advice, thanks!

bwee commented 2 years ago

Issue persists. I have tried reinstalling essentially everything from scratch several times, and I've even used git to clone a previous DEG branch to see if that would fix the problem.

mmh513 commented 2 years ago

Hello, did you ever find a solution to this issue?

bwee commented 2 years ago

I was able to solve the issue by completely cloning an old version of DEG from another computer. When we use this version, DEG works great with sigmoid final_activation for the feature_extractor and sequence models. When we use softmax (which is what we need because we have mutually exclusive behaviors), feature_extractor works fine but the sequence model shows a different error. Overall though this old version works. I think there's a conflict with the current version of Torch and the DEG code.

bwee commented 2 years ago

If this solution works for you I can look into generating a yaml file that has the installation later today. Remind me if I don't respond.

mmh513 commented 2 years ago

Thanks for your response, this was helpful. So up until now I have never used any activation outside of sigmoid. This is my first model I have used softmax for. Therefore, the error must be relating to the softmax function. I will try re-training this model with sigmoid and see if this returns the same error.

bwee commented 2 years ago

That'd be great to know if there is no issue with sigmoid. Let me know.

jbohnslav commented 2 years ago

can you try to pip install --upgrade deepethogram and see if it's fixed?