Closed jianpingliu closed 2 years ago
Maybe looks like the call to main.py
is wrong, specifically this bit:
lightning.trainer.accumulate_grad_batches=1
should be an integer >0
Thanks! It really helped.
Rather I think the issue is that we're passing accumulate_grad_batches=None
to Trainer.from_argparse_args
. Fixed the above error by adding
diff --git a/main.py b/main.py
index b21a775..c2a6e2f 100644
--- a/main.py
+++ b/main.py
@@ -835,6 +835,7 @@ if __name__ == "__main__":
from pytorch_lightning.trainer.connectors.checkpoint_connector import CheckpointConnector
setattr(CheckpointConnector, "hpc_resume_path", None)
+ trainer_kwargs['accumulate_grad_batches'] = 1
trainer = Trainer.from_argparse_args(trainer_opt, **trainer_kwargs)
trainer.logdir = logdir ###
Followed the fine tune instruction, got this error:
Merged modelckpt-cfg: {'target': 'pytorch_lightning.callbacks.ModelCheckpoint', 'params': {'dirpath': 'logs/2022-10-05T06-00-26_pokemon/checkpoints', 'filename': '{epoch:06}', 'verbose': True, 'save_last': True, 'monitor': None, 'save_top_k': -1, 'every_n_train_steps': 2000}} /usr/local/lib/python3.7/dist-packages/pytorch_lightning/callbacks/model_checkpoint.py:433: UserWarning: ModelCheckpoint(save_last=True, save_top_k=None, monitor=None) is a redundant configuration. You can save the last checkpoint with ModelCheckpoint(save_top_k=None, monitor=None). "ModelCheckpoint(save_last=True, save_top_k=None, monitor=None) is a redundant configuration." ModelCheckpoint(save_last=True, save_top_k=-1, monitor=None) will duplicate the last checkpoint saved. Traceback (most recent call last): File "main.py", line 812, in
trainer = Trainer.from_argparse_args(trainer_opt, trainer_kwargs)
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/properties.py", line 421, in from_argparse_args
return from_argparse_args(cls, args, kwargs)
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/utilities/argparse.py", line 52, in from_argparse_args
return cls(trainer_kwargs)
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 40, in insert_env_defaults
return fn(self, kwargs)
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/trainer.py", line 446, in init
terminate_on_nan,
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/connectors/training_trick_connector.py", line 50, in on_trainer_init
self.configure_accumulated_gradients(accumulate_grad_batches)
File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/connectors/training_trick_connector.py", line 66, in configure_accumulated_gradients
raise TypeError("Gradient accumulation supports only int and dict types")
TypeError: Gradient accumulation supports only int and dict types
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "main.py", line 909, in
if trainer.global_rank == 0:
NameError: name 'trainer' is not defined