Closed nishitanand closed 2 weeks ago
cc @muellerzr and @SunMarc
Hey @nishitanand, thanks for reporting ! Could you share your traceback ? This shouldn't happen as with your current version of transformers (4.43.4), ShardedDDPOption
no longer exist. Mayve try to uninstall transformers then install it again.
Hi, I uninstalled and installed transformers again. I have tried with transformers version 4.44.2 as well. Same error. I think the problem is that the code uses/requires sharded DDP and I think sharded DDP is removed after transformers v4.34.0 i.e. v4.35.0 onwards. Earlier I used vicunav1.5 and the older version of tranformers worked fine, but I have upgraded Vicuna-v1.5 to Llama 3.1 and Llama 3.1 requires newer version of transformers, which sadly doesn't have sharded DDP.
Here is the traceback:
Traceback (most recent call last):
File "/fs/nexus-scratch/nishit/gamma/encoderfree/EVE/train_mem.py", line 14, in
from eve.train.train import train
File "/fs/nexus-scratch/nishit/gamma/encoderfree/EVE/eve/train/train.py", line 43, in
from eve.train.eve_trainer import EVETrainer
File "/fs/nexus-scratch/nishit/gamma/encoderfree/EVE/eve/train/eve_trainer.py", line 8, in
from transformers.trainer import (ALL_LAYERNORM_LAYERS, ShardedDDPOption,
ImportError: cannot import name 'ShardedDDPOption' from 'transformers.trainer' (/fs/nexus-scratch/nishit/miniconda3/envs/eve_llama3/lib/python3.10/site-packages/transformers/trainer.py)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 671982) of binary: /fs/nexus-scratch/nishit/miniconda3/envs/eve_llama3/bin/python
Traceback (most recent call last):
File "/fs/nexus-scratch/nishit/miniconda3/envs/eve_llama3/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==2.0.1', 'console_scripts', 'torchrun')())
File "/fs/nexus-scratch/nishit/miniconda3/envs/eve_llama3/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/fs/nexus-scratch/nishit/miniconda3/envs/eve_llama3/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/fs/nexus-scratch/nishit/miniconda3/envs/eve_llama3/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/fs/nexus-scratch/nishit/miniconda3/envs/eve_llama3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/fs/nexus-scratch/nishit/miniconda3/envs/eve_llama3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
train_mem.py FAILED
Failures:
That's indeed the case. It looks like the code in their repo needs to be updated to work with the current Trainer. Sorry for the breaking change. Do you know what replaced SharedDDP
@muellerzr, so that @nishitanand can fix the trainer subclass ?
Hi @muellerzr, I'd really appreciate it if you could throw some light on the issue. I'm working on a priority project.
Hi @SunMarc, any pointers on how to solve the issue?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
I am getting the following error, but this error should not be there - cannot import name 'ShardedDDPOption' from 'transformers.trainer'
I have the following versions installed - tokenizers-0.19.1 transformers-4.43.4 huggingface-hub-0.24.6
I have upgraded Vicuna -7v-v1.5 to llama 3.1 8B in this github repo - https://github.com/baaivision/EVE
This works with the vicuna-7b-v1.5, but not with llama3.1 8B. It should work as there isn't much change. I earlier got rope error, but solved it by upgrading transformers as guided in this issue - https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct/discussions/15
Who can help?
https://github.com/amyeroberts https://github.com/muellerzr https://github.com/SunMarc
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I run bash eve7b_prealign.sh 0 localhost
This works with the vicuna-7b-v1.5, but not with llama3.1 8B. It should work as there isn't much change. I earlier got rope error, but solved it by upgrading transformers as guided in this issue - https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct/discussions/15
Expected behavior
The model should start training