Closed shahules786 closed 1 year ago
Hello, could you try the latest release and let us know if that resolves the issues?
Getting ModuleNotFoundError: No module named 'funtuner'
when trying to run python3 funtuner/trainer.py
Hi @pacman100 , can you add the PYTHONPATH and try again?
export PYTHONPATH="${PYTHONPATH}:/your-path/Funtuner"
Also checkout the dev-train
branch. The issue remains the same with the latest version. I tried that.
Also, on how many GPUs are you running this?
V 100 16GB - 1.
with one GPU, there won't be any sharing of the optim states and gradients, therefore it will be same as DDP. So a bit confused there
Also, getting various issues when running with 2 GPUs:
main-branch
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
dev-train branch
Traceback (most recent call last):
File "/home/sourab/Funtuner/funtuner/trainer.py", line 28, in train
os.mkdir(cfg.log_dir)
FileNotFoundError: [Errno 2] No such file or directory: '/scratch/c.scmse/Funtuner-logs'
The main branch is not updated, please stick to dev-train for now. For fixing this error, please change the log_dir
to your folder here also you might want to set log_wandb=False
I have run this branch on single and multi GPU settings. Although now I use only single GPU for redpajama-3B model.
with one GPU, there won't be any sharing of the optim states and gradients, therefore it will be same as DDP. So a bit confused there
I think in single GPU + Deepspeed zero 2 I can benefit from zero offloading and smart GPU mem management allowing me to fit larger models/batch sizes.
above PR should resolve the DS issue
I'll try it out one merged.
System Info
transformers v4.30.0 python 3.8
Training using
deepspeed stage zero 2
hit an error when in evaluation/prediction loop. Both prediction/evaluation initiate [deepspeed with inference=True] (https://github.com/huggingface/transformers/blob/6793f0cfe0006d7cedfb9b6081f55d9d38eae18a/src/transformers/trainer.py#L3045) and hence now can't run inference for anything other than stage 3 (inference not supported for zero 1/2).So my question is how to run deepspeed zero 2? My code is here
Error stack
Traceback (most recent call last): File "funtuner/trainer.py", line 98, in train trainer.train() File "/nfshome/store03/users/c.scmse/venv/lib/python3.8/site-packages/transformers/trainer.py", line 1645, in train return inner_training_loop( File "/nfshome/store03/users/c.scmse/venv/lib/python3.8/site-packages/transformers/trainer.py", line 2011, in _inner_training_loop self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval) File "/nfshome/store03/users/c.scmse/venv/lib/python3.8/site-packages/transformers/trainer.py", line 2312, in _maybe_log_save_evaluate metrics = self.evaluate(ignore_keys=ignore_keys_for_eval) File "/nfshome/store03/users/c.scmse/venv/lib/python3.8/site-packages/transformers/trainer.py", line 3043, in evaluate output = eval_loop( File "/nfshome/store03/users/c.scmse/venv/lib/python3.8/site-packages/transformers/trainer.py", line 3769, in prediction_loop _, _ = deepspeed_init(self, num_training_steps=0, inference=True) File "/nfshome/store03/users/c.scmse/venv/lib/python3.8/site-packages/transformers/deepspeed.py", line 351, in deepspeed_init raise ValueError("ZeRO inference only makes sense with ZeRO Stage 3 - please adjust your config") ValueError: ZeRO inference only makes sense with ZeRO Stage 3 - please adjust your config
Who can help?
@pacman100
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
My code is here Run
python3 funtuner/trainer.py
Expected behavior
Run evaluation loop without any error using deepspeed stage 1 and 2.