huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.34k stars 26.86k forks source link

cannot import name 'ShardedDDPOption' from 'transformers.trainer' #33242

Closed nishitanand closed 2 weeks ago

nishitanand commented 2 months ago

System Info

I am getting the following error, but this error should not be there - cannot import name 'ShardedDDPOption' from 'transformers.trainer'

I have the following versions installed - tokenizers-0.19.1 transformers-4.43.4 huggingface-hub-0.24.6

I have upgraded Vicuna -7v-v1.5 to llama 3.1 8B in this github repo - https://github.com/baaivision/EVE

This works with the vicuna-7b-v1.5, but not with llama3.1 8B. It should work as there isn't much change. I earlier got rope error, but solved it by upgrading transformers as guided in this issue - https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct/discussions/15

Who can help?

https://github.com/amyeroberts https://github.com/muellerzr https://github.com/SunMarc

Information

Tasks

Reproduction

I run bash eve7b_prealign.sh 0 localhost

This works with the vicuna-7b-v1.5, but not with llama3.1 8B. It should work as there isn't much change. I earlier got rope error, but solved it by upgrading transformers as guided in this issue - https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct/discussions/15

Expected behavior

The model should start training

LysandreJik commented 2 months ago

cc @muellerzr and @SunMarc

SunMarc commented 2 months ago

Hey @nishitanand, thanks for reporting ! Could you share your traceback ? This shouldn't happen as with your current version of transformers (4.43.4), ShardedDDPOption no longer exist. Mayve try to uninstall transformers then install it again.

nishitanand commented 2 months ago

Hi, I uninstalled and installed transformers again. I have tried with transformers version 4.44.2 as well. Same error. I think the problem is that the code uses/requires sharded DDP and I think sharded DDP is removed after transformers v4.34.0 i.e. v4.35.0 onwards. Earlier I used vicunav1.5 and the older version of tranformers worked fine, but I have upgraded Vicuna-v1.5 to Llama 3.1 and Llama 3.1 requires newer version of transformers, which sadly doesn't have sharded DDP.

Here is the traceback:

Traceback (most recent call last):
File "/fs/nexus-scratch/nishit/gamma/encoderfree/EVE/train_mem.py", line 14, in
from eve.train.train import train File "/fs/nexus-scratch/nishit/gamma/encoderfree/EVE/eve/train/train.py", line 43, in
from eve.train.eve_trainer import EVETrainer File "/fs/nexus-scratch/nishit/gamma/encoderfree/EVE/eve/train/eve_trainer.py", line 8, in
from transformers.trainer import (ALL_LAYERNORM_LAYERS, ShardedDDPOption, ImportError: cannot import name 'ShardedDDPOption' from 'transformers.trainer' (/fs/nexus-scratch/nishit/miniconda3/envs/eve_llama3/lib/python3.10/site-packages/transformers/trainer.py)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 671982) of binary: /fs/nexus-scratch/nishit/miniconda3/envs/eve_llama3/bin/python
Traceback (most recent call last): File "/fs/nexus-scratch/nishit/miniconda3/envs/eve_llama3/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==2.0.1', 'console_scripts', 'torchrun')()) File "/fs/nexus-scratch/nishit/miniconda3/envs/eve_llama3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs) File "/fs/nexus-scratch/nishit/miniconda3/envs/eve_llama3/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
run(args) File "/fs/nexus-scratch/nishit/miniconda3/envs/eve_llama3/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch( File "/fs/nexus-scratch/nishit/miniconda3/envs/eve_llama3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args)) File "/fs/nexus-scratch/nishit/miniconda3/envs/eve_llama3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: train_mem.py FAILED Failures:

SunMarc commented 2 months ago

That's indeed the case. It looks like the code in their repo needs to be updated to work with the current Trainer. Sorry for the breaking change. Do you know what replaced SharedDDP @muellerzr, so that @nishitanand can fix the trainer subclass ?

nishitanand commented 2 months ago

Hi @muellerzr, I'd really appreciate it if you could throw some light on the issue. I'm working on a priority project.

nishitanand commented 2 months ago

Hi @SunMarc, any pointers on how to solve the issue?

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.