Open sstoia opened 7 months ago
cc @muellerzr @pacman100
Sorry for the delay, will be looking into it over this week!
I'm running into the same issue. Any updates on this please?
Gentle ping @muellerzr @pacman100
Another ping @muellerzr @pacman100
Running into the same issue. Using the latest version of transformers (4.40.1) and python: 3.11
Having the same issue with my Trainer subclass when doing HPO with DDP and optuna.
Gentle ping @muellerzr
Same issue here. Also trying to run hyperparameter search with DDP (accelerate launch) using Trainer and Optuna as the backend.
The following error is returned:
[rank2]: Traceback (most recent call last):
[rank2]: File "/scratch-small-local/303700.hpc1.hpc/run_hpo_optuna.py", line 1054, in <module>
[rank2]: main()
[rank2]: File "/scratch-small-local/303700.hpc1.hpc/run_hpo_optuna.py", line 956, in main
[rank2]: best_trial = trainer.hyperparameter_search(
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/home/2x22/mlm/miniconda/envs/thesis/lib/python3.11/site-packages/transformers/trainer.py", line 3206, in hyperparameter_search
[rank2]: best_run = backend_obj.run(self, n_trials, direction, **kwargs)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/home/2x22/mlm/miniconda/envs/thesis/lib/python3.11/site-packages/transformers/hyperparameter_search.py", line 72, in run
[rank2]: return run_hp_search_optuna(trainer, n_trials, direction, **kwargs)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/home/2x22/mlm/miniconda/envs/thesis/lib/python3.11/site-packages/transformers/integrations/integration_utils.py", line 237, in run_hp_search_optuna
[rank2]: args = pickle.loads(bytes(args_main_rank))
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: _pickle.UnpicklingError: pickle data was truncated
[rank1]: Traceback (most recent call last):
[rank1]: File "/scratch-small-local/303700.hpc1.hpc/run_hpo_optuna.py", line 1054, in <module>
[rank1]: main()
[rank1]: File "/scratch-small-local/303700.hpc1.hpc/run_hpo_optuna.py", line 956, in main
[rank1]: best_trial = trainer.hyperparameter_search(
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/2x22/mlm/miniconda/envs/thesis/lib/python3.11/site-packages/transformers/trainer.py", line 3206, in hyperparameter_search
[rank1]: best_run = backend_obj.run(self, n_trials, direction, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/2x22/mlm/miniconda/envs/thesis/lib/python3.11/site-packages/transformers/hyperparameter_search.py", line 72, in run
[rank1]: return run_hp_search_optuna(trainer, n_trials, direction, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/2x22/mlm/miniconda/envs/thesis/lib/python3.11/site-packages/transformers/integrations/integration_utils.py", line 237, in run_hp_search_optuna
[rank1]: args = pickle.loads(bytes(args_main_rank))
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: _pickle.UnpicklingError: pickle data was truncated
System Info
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
The problem appears when using
run_hp_search_optuna
method from transformers/integrations.py . This method is called when trying to perform an hyperparameter search with theTrainer.hyperparameter_search
method:The error obtained is the next one:
Traceback (most recent call last): File "/mnt/beegfs/sstoia/proyectos/LLM_finetuning_stratified_multiclass_optuna.py", line 266, in <module> best_trial = trainer.hyperparameter_search( File "/mnt/beegfs/sstoia/.conda/envs/env/lib/python3.9/site-packages/transformers/trainer.py", line 2592, in hyperparameter_search best_run = backend_dict[backend](self, n_trials, direction, **kwargs) File "/mnt/beegfs/sstoia/.conda/envs/env/lib/python3.9/site-packages/transformers/integrations.py", line 218, in run_hp_search_optuna args = pickle.loads(bytes(args_main_rank)) _pickle.UnpicklingError: pickle data was truncated
Expected behavior
It should work, as the same function without multi-GPU works fine. I guess the problem comes from a parallelization error, as both GPUs may write on the same file.