justinpinkney / stable-diffusion

MIT License
1.45k stars 266 forks source link

TypeError: Gradient accumulation supports only int and dict types #24

Closed jianpingliu closed 2 years ago

jianpingliu commented 2 years ago

Followed the fine tune instruction, got this error:

Merged modelckpt-cfg: {'target': 'pytorch_lightning.callbacks.ModelCheckpoint', 'params': {'dirpath': 'logs/2022-10-05T06-00-26_pokemon/checkpoints', 'filename': '{epoch:06}', 'verbose': True, 'save_last': True, 'monitor': None, 'save_top_k': -1, 'every_n_train_steps': 2000}} /usr/local/lib/python3.7/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py:433: UserWarning: ModelCheckpoint(save_last=True, save_top_k=None, monitor=None) is a redundant configuration. You can save the last checkpoint with ModelCheckpoint(save_top_k=None, monitor=None). "ModelCheckpoint(save_last=True, save_top_k=None, monitor=None) is a redundant configuration." ModelCheckpoint(save_last=True, save_top_k=-1, monitor=None) will duplicate the last checkpoint saved. Traceback (most recent call last): File "main.py", line 812, in trainer = Trainer.from_argparse_args(trainer_opt, trainer_kwargs) File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/properties.py", line 421, in from_argparse_args return from_argparse_args(cls, args, kwargs) File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/utilities/argparse.py", line 52, in from_argparse_args return cls(trainer_kwargs) File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 40, in insert_env_defaults return fn(self, kwargs) File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/trainer.py", line 446, in init terminate_on_nan, File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/connectors/training_trick_connector.py", line 50, in on_trainer_init self.configure_accumulated_gradients(accumulate_grad_batches) File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/connectors/training_trick_connector.py", line 66, in configure_accumulated_gradients raise TypeError("Gradient accumulation supports only int and dict types") TypeError: Gradient accumulation supports only int and dict types

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "main.py", line 909, in if trainer.global_rank == 0: NameError: name 'trainer' is not defined

justinpinkney commented 2 years ago

Maybe looks like the call to main.py is wrong, specifically this bit:

lightning.trainer.accumulate_grad_batches=1

should be an integer >0

jianpingliu commented 2 years ago

Thanks! It really helped.

raphaelmerx commented 1 year ago

Rather I think the issue is that we're passing accumulate_grad_batches=None to Trainer.from_argparse_args. Fixed the above error by adding

diff --git a/main.py b/main.py
index b21a775..c2a6e2f 100644
--- a/main.py
+++ b/main.py
@@ -835,6 +835,7 @@ if __name__ == "__main__":
             from pytorch_lightning.trainer.connectors.checkpoint_connector import CheckpointConnector
             setattr(CheckpointConnector, "hpc_resume_path", None)

+        trainer_kwargs['accumulate_grad_batches'] = 1
         trainer = Trainer.from_argparse_args(trainer_opt, **trainer_kwargs)
         trainer.logdir = logdir  ###