huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
135.1k stars 27.04k forks source link

Error during evaluation using deepspeed zero stage 2 #24294

Closed shahules786 closed 1 year ago

shahules786 commented 1 year ago

System Info

transformers v4.30.0 python 3.8

Training using deepspeed stage zero 2 hit an error when in evaluation/prediction loop. Both prediction/evaluation initiate [deepspeed with inference=True] (https://github.com/huggingface/transformers/blob/6793f0cfe0006d7cedfb9b6081f55d9d38eae18a/src/transformers/trainer.py#L3045) and hence now can't run inference for anything other than stage 3 (inference not supported for zero 1/2).

So my question is how to run deepspeed zero 2? My code is here

Error stack Traceback (most recent call last): File "funtuner/trainer.py", line 98, in train trainer.train() File "/nfshome/store03/users/c.scmse/venv/lib/python3.8/site-packages/transformers/trainer.py", line 1645, in train return inner_training_loop( File "/nfshome/store03/users/c.scmse/venv/lib/python3.8/site-packages/transformers/trainer.py", line 2011, in _inner_training_loop self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval) File "/nfshome/store03/users/c.scmse/venv/lib/python3.8/site-packages/transformers/trainer.py", line 2312, in _maybe_log_save_evaluate metrics = self.evaluate(ignore_keys=ignore_keys_for_eval) File "/nfshome/store03/users/c.scmse/venv/lib/python3.8/site-packages/transformers/trainer.py", line 3043, in evaluate output = eval_loop( File "/nfshome/store03/users/c.scmse/venv/lib/python3.8/site-packages/transformers/trainer.py", line 3769, in prediction_loop _, _ = deepspeed_init(self, num_training_steps=0, inference=True) File "/nfshome/store03/users/c.scmse/venv/lib/python3.8/site-packages/transformers/deepspeed.py", line 351, in deepspeed_init raise ValueError("ZeRO inference only makes sense with ZeRO Stage 3 - please adjust your config") ValueError: ZeRO inference only makes sense with ZeRO Stage 3 - please adjust your config

Who can help?

@pacman100

Information

Tasks

Reproduction

My code is here Run python3 funtuner/trainer.py

Expected behavior

Run evaluation loop without any error using deepspeed stage 1 and 2.

pacman100 commented 1 year ago

Hello, could you try the latest release and let us know if that resolves the issues?

pacman100 commented 1 year ago

Getting ModuleNotFoundError: No module named 'funtuner' when trying to run python3 funtuner/trainer.py

shahules786 commented 1 year ago

Hi @pacman100 , can you add the PYTHONPATH and try again? export PYTHONPATH="${PYTHONPATH}:/your-path/Funtuner" Also checkout the dev-train branch. The issue remains the same with the latest version. I tried that.

pacman100 commented 1 year ago

Also, on how many GPUs are you running this?

shahules786 commented 1 year ago

V 100 16GB - 1.

pacman100 commented 1 year ago

with one GPU, there won't be any sharing of the optim states and gradients, therefore it will be same as DDP. So a bit confused there

pacman100 commented 1 year ago

Also, getting various issues when running with 2 GPUs:

main-branch

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

dev-train branch

Traceback (most recent call last):
  File "/home/sourab/Funtuner/funtuner/trainer.py", line 28, in train
    os.mkdir(cfg.log_dir)
FileNotFoundError: [Errno 2] No such file or directory: '/scratch/c.scmse/Funtuner-logs'
shahules786 commented 1 year ago

The main branch is not updated, please stick to dev-train for now. For fixing this error, please change the log_dir to your folder here also you might want to set log_wandb=False I have run this branch on single and multi GPU settings. Although now I use only single GPU for redpajama-3B model.

shahules786 commented 1 year ago

with one GPU, there won't be any sharing of the optim states and gradients, therefore it will be same as DDP. So a bit confused there

I think in single GPU + Deepspeed zero 2 I can benefit from zero offloading and smart GPU mem management allowing me to fit larger models/batch sizes.

pacman100 commented 1 year ago

above PR should resolve the DS issue

shahules786 commented 1 year ago

I'll try it out one merged.