txie-93 / cdvae

An SE(3)-invariant autoencoder for generating the periodic structure of materials [ICLR 2022]
MIT License
211 stars 85 forks source link

pytorch_lightning.utilities.exceptions.MisconfigurationException: ReduceLROnPlateau conditioned on metric val_loss #43

Open HarshaSatyavardhan opened 1 year ago

HarshaSatyavardhan commented 1 year ago
  | Name           | Type                | Params
-------------------------------------------------------
0 | encoder        | DimeNetPlusPlusWrap | 2.2 M 
1 | decoder        | GemNetTDecoder      | 2.3 M 
2 | fc_mu          | Linear              | 65.8 K
3 | fc_var         | Linear              | 65.8 K
4 | fc_num_atoms   | Sequential          | 71.2 K
5 | fc_lattice     | Sequential          | 67.3 K
6 | fc_composition | Sequential          | 91.5 K
-------------------------------------------------------
4.9 M     Trainable params
123       Non-trainable params
4.9 M     Total params
19.682    Total estimated model params size (MB)
/home2/harsha.vasamsetti/miniconda3/envs/cdvae/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py:631: UserWarning: Checkpoint directory /scratch/harsha.vasamsetti/hydra/singlerun/2023-05-26/perov exists and is not empty.
  rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.")
Validation sanity check: 0it [00:00, ?it/s]/home2/harsha.vasamsetti/miniconda3/envs/cdvae/lib/python3.8/site-packages/torch_geometric/deprecation.py:13: UserWarning: 'data.DataLoader' is deprecated, use 'loader.DataLoader' instead
  warnings.warn(out)
/home2/harsha.vasamsetti/miniconda3/envs/cdvae/lib/python3.8/site-packages/pytorch_lightning/trainer/data_loading.py:116: UserWarning: The dataloader, val_dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 40 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  rank_zero_warn(
Validation sanity check:   0%|                                                                | 0/2 [00:00<?, ?it/s]/scratch/harsha.vasamsetti/cdvae/cdvae/common/data_utils.py:622: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  X = torch.tensor(X, dtype=torch.float)
/scratch/harsha.vasamsetti/cdvae/cdvae/common/data_utils.py:618: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  X = torch.tensor(X, dtype=torch.float)
/home2/harsha.vasamsetti/miniconda3/envs/cdvae/lib/python3.8/site-packages/pytorch_lightning/utilities/data.py:59: UserWarning: Trying to infer the `batch_size` from an ambiguous collection. The batch size we found is 10. To avoid any miscalculations, use `self.log(..., batch_size=batch_size)`.
  warning_cache.warn(
/home2/harsha.vasamsetti/miniconda3/envs/cdvae/lib/python3.8/site-packages/pytorch_lightning/trainer/data_loading.py:116: UserWarning: The dataloader, train_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 40 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  rank_zero_warn(
/home2/harsha.vasamsetti/miniconda3/envs/cdvae/lib/python3.8/site-packages/pytorch_lightning/trainer/data_loading.py:412: UserWarning: The number of training samples (23) is smaller than the logging interval Trainer(log_every_n_steps=50). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.
  rank_zero_warn(
Epoch 0: 100%|█| 23/23 [00:09<00:00,  2.53it/s, loss=91.2, v_num=t2wv, train_loss_step=80.70, train_natom_loss_step=Error executing job with overrides: ['data=perov', 'expname=perov']
Traceback (most recent call last):
  File "cdvae/run.py", line 167, in main
    run(cfg)
  File "cdvae/run.py", line 155, in run
    trainer.fit(model=model, datamodule=datamodule)
  File "/home2/harsha.vasamsetti/miniconda3/envs/cdvae/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 738, in fit
    self._call_and_handle_interrupt(
  File "/home2/harsha.vasamsetti/miniconda3/envs/cdvae/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 683, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home2/harsha.vasamsetti/miniconda3/envs/cdvae/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 773, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home2/harsha.vasamsetti/miniconda3/envs/cdvae/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1195, in _run
    self._dispatch()
  File "/home2/harsha.vasamsetti/miniconda3/envs/cdvae/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1275, in _dispatch
    self.training_type_plugin.start_training(self)
  File "/home2/harsha.vasamsetti/miniconda3/envs/cdvae/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 202, in start_training
    self._results = trainer.run_stage()
  File "/home2/harsha.vasamsetti/miniconda3/envs/cdvae/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1285, in run_stage
    return self._run_train()
  File "/home2/harsha.vasamsetti/miniconda3/envs/cdvae/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1315, in _run_train
    self.fit_loop.run()
  File "/home2/harsha.vasamsetti/miniconda3/envs/cdvae/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 145, in run
    self.advance(*args, **kwargs)
  File "/home2/harsha.vasamsetti/miniconda3/envs/cdvae/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 234, in advance
    self.epoch_loop.run(data_fetcher)
  File "/home2/harsha.vasamsetti/miniconda3/envs/cdvae/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 151, in run
    output = self.on_run_end()
  File "/home2/harsha.vasamsetti/miniconda3/envs/cdvae/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 303, in on_run_end
    self.update_lr_schedulers("epoch", update_plateau_schedulers=True)
  File "/home2/harsha.vasamsetti/miniconda3/envs/cdvae/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 441, in update_lr_schedulers
    self._update_learning_rates(
  File "/home2/harsha.vasamsetti/miniconda3/envs/cdvae/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 486, in _update_learning_rates
    raise MisconfigurationException(
pytorch_lightning.utilities.exceptions.MisconfigurationException: ReduceLROnPlateau conditioned on metric val_loss which is not available. Available metrics are: ['train_loss', 'train_loss_step', 'train_natom_loss', 'train_natom_loss_step', 'train_lattice_loss', 'train_lattice_loss_step', 'train_coord_loss', 'train_coord_loss_step', 'train_type_loss', 'train_type_loss_step', 'train_kld_loss', 'train_kld_loss_step', 'train_composition_loss', 'train_composition_loss_step', 'train_loss_epoch', 'train_natom_loss_epoch', 'train_lattice_loss_epoch', 'train_coord_loss_epoch', 'train_type_loss_epoch', 'train_kld_loss_epoch', 'train_composition_loss_epoch']. Condition can be set using `monitor` key in lr scheduler dict

When training the code, I am receiving this error.

zhuccly commented 10 months ago
  | Name           | Type                | Params
-------------------------------------------------------
0 | encoder        | DimeNetPlusPlusWrap | 2.2 M 
1 | decoder        | GemNetTDecoder      | 2.3 M 
2 | fc_mu          | Linear              | 65.8 K
3 | fc_var         | Linear              | 65.8 K
4 | fc_num_atoms   | Sequential          | 71.2 K
5 | fc_lattice     | Sequential          | 67.3 K
6 | fc_composition | Sequential          | 91.5 K
-------------------------------------------------------
4.9 M     Trainable params
123       Non-trainable params
4.9 M     Total params
19.682    Total estimated model params size (MB)
/home2/harsha.vasamsetti/miniconda3/envs/cdvae/lib/python3.8/site-packages/pytorch_lightning/callbacks/model_checkpoint.py:631: UserWarning: Checkpoint directory /scratch/harsha.vasamsetti/hydra/singlerun/2023-05-26/perov exists and is not empty.
  rank_zero_warn(f"Checkpoint directory {dirpath} exists and is not empty.")
Validation sanity check: 0it [00:00, ?it/s]/home2/harsha.vasamsetti/miniconda3/envs/cdvae/lib/python3.8/site-packages/torch_geometric/deprecation.py:13: UserWarning: 'data.DataLoader' is deprecated, use 'loader.DataLoader' instead
  warnings.warn(out)
/home2/harsha.vasamsetti/miniconda3/envs/cdvae/lib/python3.8/site-packages/pytorch_lightning/trainer/data_loading.py:116: UserWarning: The dataloader, val_dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 40 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  rank_zero_warn(
Validation sanity check:   0%|                                                                | 0/2 [00:00<?, ?it/s]/scratch/harsha.vasamsetti/cdvae/cdvae/common/data_utils.py:622: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  X = torch.tensor(X, dtype=torch.float)
/scratch/harsha.vasamsetti/cdvae/cdvae/common/data_utils.py:618: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  X = torch.tensor(X, dtype=torch.float)
/home2/harsha.vasamsetti/miniconda3/envs/cdvae/lib/python3.8/site-packages/pytorch_lightning/utilities/data.py:59: UserWarning: Trying to infer the `batch_size` from an ambiguous collection. The batch size we found is 10. To avoid any miscalculations, use `self.log(..., batch_size=batch_size)`.
  warning_cache.warn(
/home2/harsha.vasamsetti/miniconda3/envs/cdvae/lib/python3.8/site-packages/pytorch_lightning/trainer/data_loading.py:116: UserWarning: The dataloader, train_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 40 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  rank_zero_warn(
/home2/harsha.vasamsetti/miniconda3/envs/cdvae/lib/python3.8/site-packages/pytorch_lightning/trainer/data_loading.py:412: UserWarning: The number of training samples (23) is smaller than the logging interval Trainer(log_every_n_steps=50). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.
  rank_zero_warn(
Epoch 0: 100%|█| 23/23 [00:09<00:00,  2.53it/s, loss=91.2, v_num=t2wv, train_loss_step=80.70, train_natom_loss_step=Error executing job with overrides: ['data=perov', 'expname=perov']
Traceback (most recent call last):
  File "cdvae/run.py", line 167, in main
    run(cfg)
  File "cdvae/run.py", line 155, in run
    trainer.fit(model=model, datamodule=datamodule)
  File "/home2/harsha.vasamsetti/miniconda3/envs/cdvae/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 738, in fit
    self._call_and_handle_interrupt(
  File "/home2/harsha.vasamsetti/miniconda3/envs/cdvae/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 683, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home2/harsha.vasamsetti/miniconda3/envs/cdvae/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 773, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home2/harsha.vasamsetti/miniconda3/envs/cdvae/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1195, in _run
    self._dispatch()
  File "/home2/harsha.vasamsetti/miniconda3/envs/cdvae/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1275, in _dispatch
    self.training_type_plugin.start_training(self)
  File "/home2/harsha.vasamsetti/miniconda3/envs/cdvae/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 202, in start_training
    self._results = trainer.run_stage()
  File "/home2/harsha.vasamsetti/miniconda3/envs/cdvae/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1285, in run_stage
    return self._run_train()
  File "/home2/harsha.vasamsetti/miniconda3/envs/cdvae/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1315, in _run_train
    self.fit_loop.run()
  File "/home2/harsha.vasamsetti/miniconda3/envs/cdvae/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 145, in run
    self.advance(*args, **kwargs)
  File "/home2/harsha.vasamsetti/miniconda3/envs/cdvae/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 234, in advance
    self.epoch_loop.run(data_fetcher)
  File "/home2/harsha.vasamsetti/miniconda3/envs/cdvae/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 151, in run
    output = self.on_run_end()
  File "/home2/harsha.vasamsetti/miniconda3/envs/cdvae/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 303, in on_run_end
    self.update_lr_schedulers("epoch", update_plateau_schedulers=True)
  File "/home2/harsha.vasamsetti/miniconda3/envs/cdvae/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 441, in update_lr_schedulers
    self._update_learning_rates(
  File "/home2/harsha.vasamsetti/miniconda3/envs/cdvae/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 486, in _update_learning_rates
    raise MisconfigurationException(
pytorch_lightning.utilities.exceptions.MisconfigurationException: ReduceLROnPlateau conditioned on metric val_loss which is not available. Available metrics are: ['train_loss', 'train_loss_step', 'train_natom_loss', 'train_natom_loss_step', 'train_lattice_loss', 'train_lattice_loss_step', 'train_coord_loss', 'train_coord_loss_step', 'train_type_loss', 'train_type_loss_step', 'train_kld_loss', 'train_kld_loss_step', 'train_composition_loss', 'train_composition_loss_step', 'train_loss_epoch', 'train_natom_loss_epoch', 'train_lattice_loss_epoch', 'train_coord_loss_epoch', 'train_type_loss_epoch', 'train_kld_loss_epoch', 'train_composition_loss_epoch']. Condition can be set using `monitor` key in lr scheduler dict

When training the code, I am receiving this error.

Hi, I have the same issue, Did you solve it?

confymacs commented 7 months ago

Hi, I changed the 'strict' parameter in the scheduler to 'False' (by default it should be True) and solved the problem. Here is how I modified the configure_optimizer function:

def configure_optimizers(self):
        opt = hydra.utils.instantiate(
            self.hparams.optim.optimizer, params=self.parameters(), _convert_="partial"
        )
        if not self.hparams.optim.use_lr_scheduler:
            return [opt]
        scheduler = hydra.utils.instantiate(
            self.hparams.optim.lr_scheduler, optimizer=opt
        )

        return {"optimizer": opt, 
                "lr_scheduler": {
                    "scheduler": scheduler,
                    "monitor": "val_loss",
                    "strict": False}}