huggingface / jat

General multi-task deep RL Agent
Apache License 2.0
162 stars 11 forks source link

Training: Error while saving checkpoint during Training (via save steps) #172

Open drdsgvo opened 4 months ago

drdsgvo commented 4 months ago

With transformers 4.41.0., Ubuntu 22.0

Calling the training script scripts/train_jat_tokenized.py as given (with --per_device_train_batch_size 1 and one GPU) the following error comes when the system tries to save the first checkpoint:

from trainer.train(..) in above script, end of file: File "/home/km/.local/lib/python3.10/site-packages/transformers/trainer.py", line 1885, in train return inner_training_loop( File "/home/km/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2291, in _inner_training_loop self._maybe_log_save_evaluate(tr_loss, grad_norm, model, trial, epoch, ignore_keys_for_eval) File "/home/km/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2732, in _maybe_log_save_evaluate self._save_checkpoint(model, trial, metrics=metrics) File "/home/km/.local/lib/python3.10/site-packages/transformers/trainer.py", line 2811, in _save_checkpoint self.save_model(output_dir, _internal_call=True) File "/home/km/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3355, in save_model self._save(output_dir) File "/home/km/.local/lib/python3.10/site-packages/transformers/trainer.py", line 3432, in _save self.model.save_pretrained( File "/home/km/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2574, in save_pretrained raise RuntimeError( RuntimeError: The weights trying to be saved contained shared tensors [{'transformer.wte.weight', 'single_discrete_encoder.weight', 'multi_discrete_encoder.0.weight'}] that are mismatching the transformers base configuration. Try saving using safe_serialization=False or remove this tensor sharing.

The error comes up with using accelerate launch and without (just using python Githubissues.

  • Githubissues is a development platform for aggregating issues.