Open le153234 opened 1 year ago
@le153234 there should be an output log at /content/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/125m/training.log
- can you share the contents of that file?
attach the output log 125m-training.log
I got the same error with GPT-J 6B
I got the same error, but with return code=-7
I'm getting the same error code, I'm trying to use demo setp cuda:11.6 torch=1.12 cudnn=8.4.0 python=3.8
Same error here, No detail tips ~
wsl cat /proc/version Linux version 5.15.90.1-microsoft-standard-WSL2 (oe-user@oe-host) (x86_64-msft-linux-gcc (GCC) 9.3.0, GNU ld (GNU Binutils) 2.34.0.20200220) #1 SMP Fri Jan 27 02:56:13 UTC 2023
DeepSpeed general environment info: torch install path ............... ['/home/dev/.local/lib/python3.8/site-packages/torch'] torch version .................... 2.0.0+cu117 deepspeed install path ........... ['/home/dev/.local/lib/python3.8/site-packages/deepspeed'] deepspeed info ................... 0.9.1, unknown, unknown torch cuda version ............... 11.7 torch hip version ................ None nvcc version ..................... 12.1 deepspeed wheel compiled w. ...... torch 2.0, cuda 11.7
any updates?
Hi,
Same issue for me, no detail information of that error in my output. If any reference link on this error or any update on this issue?
same here
Same error
same error
same error
same error
same error
[2023-04-14 13:11:27,879] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 13266 [2023-04-14 13:11:27,885] [ERROR] [launch.py:434:sigkill_handler] ['/usr/bin/python3', '-u', 'main.py', '--local_rank=0', '--data_path', 'Dahoas/rm-static', 'Dahoas/full-hh-rlhf', 'Dahoas/synthetic-instruct-gptj-pairwise', 'yitingxie/rlhf-reward-datasets', 'openai/webgpt_comparisons', 'stanfordnlp/SHP', '--data_split', '2,4,4', '--model_name_or_path', 'facebook/opt-125m', '--per_device_train_batch_size', '2', '--per_device_eval_batch_size', '8', '--max_seq_len', '512', '--learning_rate', '9.65e-6', '--weight_decay', '0.1', '--num_train_epochs', '2', '--gradient_accumulation_steps', '1', '--lr_scheduler_type', 'cosine', '--num_warmup_steps', '0', '--seed', '1234', '--zero_stage', '2', '--deepspeed', '--output_dir', '/content/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/125m'] exits with return code = -9