huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
128.73k stars 25.53k forks source link

ERROR in run_hp_search_optuna when trying to use multi-GPU #27487

Open sstoia opened 7 months ago

sstoia commented 7 months ago

System Info

Who can help?

No response

Information

Tasks

Reproduction

The problem appears when using run_hp_search_optuna method from transformers/integrations.py . This method is called when trying to perform an hyperparameter search with the Trainer.hyperparameter_search method:

best_trial = trainer.hyperparameter_search(
     direction='maximize',
     backend='optuna',
     hp_space=optuna_hp_space,
     n_trials=10,
)

The error obtained is the next one:

Traceback (most recent call last): File "/mnt/beegfs/sstoia/proyectos/LLM_finetuning_stratified_multiclass_optuna.py", line 266, in <module> best_trial = trainer.hyperparameter_search( File "/mnt/beegfs/sstoia/.conda/envs/env/lib/python3.9/site-packages/transformers/trainer.py", line 2592, in hyperparameter_search best_run = backend_dict[backend](self, n_trials, direction, **kwargs) File "/mnt/beegfs/sstoia/.conda/envs/env/lib/python3.9/site-packages/transformers/integrations.py", line 218, in run_hp_search_optuna args = pickle.loads(bytes(args_main_rank)) _pickle.UnpicklingError: pickle data was truncated

Expected behavior

It should work, as the same function without multi-GPU works fine. I guess the problem comes from a parallelization error, as both GPUs may write on the same file.

amyeroberts commented 7 months ago

cc @muellerzr @pacman100

muellerzr commented 7 months ago

Sorry for the delay, will be looking into it over this week!

linhdvu14 commented 5 months ago

I'm running into the same issue. Any updates on this please?

amyeroberts commented 4 months ago

Gentle ping @muellerzr @pacman100

amyeroberts commented 2 months ago

Another ping @muellerzr @pacman100

NishchalPrasad commented 2 months ago

Running into the same issue. Using the latest version of transformers (4.40.1) and python: 3.11

tomaarsen commented 1 month ago

Having the same issue with my Trainer subclass when doing HPO with DDP and optuna.

amyeroberts commented 3 weeks ago

Gentle ping @muellerzr

svduplessis commented 3 days ago

Same issue here. Also trying to run hyperparameter search with DDP (accelerate launch) using Trainer and Optuna as the backend.

The following error is returned:

[rank2]: Traceback (most recent call last):
[rank2]:   File "/scratch-small-local/303700.hpc1.hpc/run_hpo_optuna.py", line 1054, in <module>
[rank2]:     main()
[rank2]:   File "/scratch-small-local/303700.hpc1.hpc/run_hpo_optuna.py", line 956, in main
[rank2]:     best_trial = trainer.hyperparameter_search(
[rank2]:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/home/2x22/mlm/miniconda/envs/thesis/lib/python3.11/site-packages/transformers/trainer.py", line 3206, in hyperparameter_search
[rank2]:     best_run = backend_obj.run(self, n_trials, direction, **kwargs)
[rank2]:                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/home/2x22/mlm/miniconda/envs/thesis/lib/python3.11/site-packages/transformers/hyperparameter_search.py", line 72, in run
[rank2]:     return run_hp_search_optuna(trainer, n_trials, direction, **kwargs)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/home/2x22/mlm/miniconda/envs/thesis/lib/python3.11/site-packages/transformers/integrations/integration_utils.py", line 237, in run_hp_search_optuna
[rank2]:     args = pickle.loads(bytes(args_main_rank))
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: _pickle.UnpicklingError: pickle data was truncated
[rank1]: Traceback (most recent call last):
[rank1]:   File "/scratch-small-local/303700.hpc1.hpc/run_hpo_optuna.py", line 1054, in <module>
[rank1]:     main()
[rank1]:   File "/scratch-small-local/303700.hpc1.hpc/run_hpo_optuna.py", line 956, in main
[rank1]:     best_trial = trainer.hyperparameter_search(
[rank1]:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/2x22/mlm/miniconda/envs/thesis/lib/python3.11/site-packages/transformers/trainer.py", line 3206, in hyperparameter_search
[rank1]:     best_run = backend_obj.run(self, n_trials, direction, **kwargs)
[rank1]:                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/2x22/mlm/miniconda/envs/thesis/lib/python3.11/site-packages/transformers/hyperparameter_search.py", line 72, in run
[rank1]:     return run_hp_search_optuna(trainer, n_trials, direction, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/2x22/mlm/miniconda/envs/thesis/lib/python3.11/site-packages/transformers/integrations/integration_utils.py", line 237, in run_hp_search_optuna
[rank1]:     args = pickle.loads(bytes(args_main_rank))
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: _pickle.UnpicklingError: pickle data was truncated