nghorbani / human_body_prior

VPoser: Variational Human Pose Prior
https://smpl-x.is.tue.mpg.de/
Other
778 stars 138 forks source link

RuntimeError: Early stopping conditioned on metric `val_loss` which is not available. Pass in or modify your `EarlyStopping` callback to use any of the following: `` #52

Open Drow999 opened 2 years ago

Drow999 commented 2 years ago

Hello, I'm trying to train the vposer with my own train and val dataset, but it always said val_loss is not available. I guessed it might be caused by the little validate dataset, but after I reduce the batch size, the error still exists. I found some verizon of pytorch-ligthning might have this issue. Could you please tell me the verison you use and give me some advise if you have this issue as well?

Epoch 0: 88%|████████▊ | 15/17 [00:00<00:00, 37.29it/s, loss=89.1, v_num=29] Validating: 0it [00:00, ?it/s] Validating: 0%| | 0/2 [00:00<?, ?it/s]{'weighted_loss': {'loss_kl': tensor(0.0516, device='cuda:0'), 'loss_mesh_rec': tensor(81.0408, device='cuda:0'), 'matrot': tensor(3.5944, device='cuda:0'), 'loss_total': tensor(84.6868, device='cuda:0')}, 'unweighted_loss': {'v2v': tensor(55.1946, device='cuda:0'), 'loss_total': tensor([55.1946], device='cuda:0')}} {'weighted_loss': {'loss_kl': tensor(0.0580, device='cuda:0'), 'loss_mesh_rec': tensor(86.9938, device='cuda:0'), 'matrot': tensor(3.5297, device='cuda:0'), 'loss_total': tensor(90.5815, device='cuda:0')}, 'unweighted_loss': {'v2v': tensor(59.5597, device='cuda:0'), 'loss_total': tensor([59.5597], device='cuda:0')}} [1] -- Epoch 0: val_loss:57.38 [1] -- lr is [0.001] Traceback (most recent call last): File "/home/drow/human_body_prior/src/train.py", line 54, in main() File "/home/drow/human_body_prior/src/train.py", line 50, in main train_vposer_once(job) File "/home/drow/human_body_prior/src/human_body_prior/train/vposer_trainer.py", line 351, in train_vposer_once trainer.fit(model) File "/home/drow/anaconda3/envs/vae/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 552, in fit self._run(model) File "/home/drow/anaconda3/envs/vae/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 917, in _run self._dispatch() File "/home/drow/anaconda3/envs/vae/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 985, in _dispatch self.accelerator.start_training(self) File "/home/drow/anaconda3/envs/vae/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 92, in start_training self.training_type_plugin.start_training(trainer) File "/home/drow/anaconda3/envs/vae/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 161, in start_training self._results = trainer.run_stage() File "/home/drow/anaconda3/envs/vae/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 995, in run_stage return self._run_train() File "/home/drow/anaconda3/envs/vae/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1044, in _run_train self.fit_loop.run() File "/home/drow/anaconda3/envs/vae/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 111, in run self.advance(*args, **kwargs) File "/home/drow/anaconda3/envs/vae/lib/python3.7/site-packages/pytorch_lightning/loops/fit_loop.py", line 200, in advance epoch_output = self.epoch_loop.run(train_dataloader) File "/home/drow/anaconda3/envs/vae/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 118, in run output = self.on_run_end() File "/home/drow/anaconda3/envs/vae/lib/python3.7/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 235, in on_run_end self._on_train_epoch_end_hook(processed_outputs) File "/home/drow/anaconda3/envs/vae/lib/python3.7/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 276, in _on_train_epoch_end_hook trainer_hook(processed_epoch_output) File "/home/drow/anaconda3/envs/vae/lib/python3.7/site-packages/pytorch_lightning/trainer/callback_hook.py", line 109, in on_train_epoch_end callback.on_train_epoch_end(self, self.lightning_module) File "/home/drow/anaconda3/envs/vae/lib/python3.7/site-packages/pytorch_lightning/callbacks/early_stopping.py", line 170, in on_train_epoch_end self._run_early_stopping_check(trainer) File "/home/drow/anaconda3/envs/vae/lib/python3.7/site-packages/pytorch_lightning/callbacks/early_stopping.py", line 185, in _run_early_stopping_check logs File "/home/drow/anaconda3/envs/vae/lib/python3.7/site-packages/pytorch_lightning/callbacks/early_stopping.py", line 134, in _validate_condition_metric raise RuntimeError(error_msg) RuntimeError: Early stopping conditioned on metric val_loss which is not available. Pass in or modify your EarlyStopping callback to use any of the following: `` Epoch 0: 100%|██████████| 17/17 [00:00<00:00, 35.66it/s, loss=89.1, v_num=29] Epoch 0: 100%|██████████| 17/17 [00:00<00:00, 32.31it/s, loss=89.1, v_num=29]

oomq commented 2 years ago

Hello, I'm trying to train the vposer with my own train and val dataset, but it always said val_loss is not available. I guessed it might be caused by the little validate dataset, but after I reduce the batch size, the error still exists. I found some verizon of pytorch-ligthning might have this issue. Could you please tell me the verison you use and give me some advise if you have this issue as well?

Epoch 0: 88%|████████▊ | 15/17 [00:00<00:00, 37.29it/s, loss=89.1, v_num=29] Validating: 0it [00:00, ?it/s] Validating: 0%| | 0/2 [00:00<?, ?it/s]{'weighted_loss': {'loss_kl': tensor(0.0516, device='cuda:0'), 'loss_mesh_rec': tensor(81.0408, device='cuda:0'), 'matrot': tensor(3.5944, device='cuda:0'), 'loss_total': tensor(84.6868, device='cuda:0')}, 'unweighted_loss': {'v2v': tensor(55.1946, device='cuda:0'), 'loss_total': tensor([55.1946], device='cuda:0')}} {'weighted_loss': {'loss_kl': tensor(0.0580, device='cuda:0'), 'loss_mesh_rec': tensor(86.9938, device='cuda:0'), 'matrot': tensor(3.5297, device='cuda:0'), 'loss_total': tensor(90.5815, device='cuda:0')}, 'unweighted_loss': {'v2v': tensor(59.5597, device='cuda:0'), 'loss_total': tensor([59.5597], device='cuda:0')}} [1] -- Epoch 0: val_loss:57.38 [1] -- lr is [0.001] Traceback (most recent call last): File "/home/drow/human_body_prior/src/train.py", line 54, in main() File "/home/drow/human_body_prior/src/train.py", line 50, in main train_vposer_once(job) File "/home/drow/human_body_prior/src/human_body_prior/train/vposer_trainer.py", line 351, in train_vposer_once trainer.fit(model) File "/home/drow/anaconda3/envs/vae/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 552, in fit self._run(model) File "/home/drow/anaconda3/envs/vae/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 917, in _run self._dispatch() File "/home/drow/anaconda3/envs/vae/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 985, in _dispatch self.accelerator.start_training(self) File "/home/drow/anaconda3/envs/vae/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 92, in start_training self.training_type_plugin.start_training(trainer) File "/home/drow/anaconda3/envs/vae/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 161, in start_training self._results = trainer.run_stage() File "/home/drow/anaconda3/envs/vae/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 995, in run_stage return self._run_train() File "/home/drow/anaconda3/envs/vae/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1044, in _run_train self.fit_loop.run() File "/home/drow/anaconda3/envs/vae/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 111, in run self.advance(*args, **kwargs) File "/home/drow/anaconda3/envs/vae/lib/python3.7/site-packages/pytorch_lightning/loops/fit_loop.py", line 200, in advance epoch_output = self.epoch_loop.run(train_dataloader) File "/home/drow/anaconda3/envs/vae/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 118, in run output = self.on_run_end() File "/home/drow/anaconda3/envs/vae/lib/python3.7/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 235, in on_run_end self._on_train_epoch_end_hook(processed_outputs) File "/home/drow/anaconda3/envs/vae/lib/python3.7/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 276, in _on_train_epoch_end_hook trainer_hook(processed_epoch_output) File "/home/drow/anaconda3/envs/vae/lib/python3.7/site-packages/pytorch_lightning/trainer/callback_hook.py", line 109, in on_train_epoch_end callback.on_train_epoch_end(self, self.lightning_module) File "/home/drow/anaconda3/envs/vae/lib/python3.7/site-packages/pytorch_lightning/callbacks/early_stopping.py", line 170, in on_train_epoch_end self._run_early_stopping_check(trainer) File "/home/drow/anaconda3/envs/vae/lib/python3.7/site-packages/pytorch_lightning/callbacks/early_stopping.py", line 185, in _run_early_stopping_check logs File "/home/drow/anaconda3/envs/vae/lib/python3.7/site-packages/pytorch_lightning/callbacks/early_stopping.py", line 134, in _validate_condition_metric raise RuntimeError(error_msg) RuntimeError: Early stopping conditioned on metric val_loss which is not available. Pass in or modify your EarlyStopping callback to use any of the following: `` Epoch 0: 100%|██████████| 17/17 [00:00<00:00, 35.66it/s, loss=89.1, v_num=29] Epoch 0: 100%|██████████| 17/17 [00:00<00:00, 32.31it/s, loss=89.1, v_num=29]

I meet the same problem.