microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
34.99k stars 4.06k forks source link

[BUG] trainning [ERROR] [launch.py:434:sigkill_handler] exits with return code = -9 #3232

Open le153234 opened 1 year ago

le153234 commented 1 year ago

[2023-04-14 13:11:27,879] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 13266 [2023-04-14 13:11:27,885] [ERROR] [launch.py:434:sigkill_handler] ['/usr/bin/python3', '-u', 'main.py', '--local_rank=0', '--data_path', 'Dahoas/rm-static', 'Dahoas/full-hh-rlhf', 'Dahoas/synthetic-instruct-gptj-pairwise', 'yitingxie/rlhf-reward-datasets', 'openai/webgpt_comparisons', 'stanfordnlp/SHP', '--data_split', '2,4,4', '--model_name_or_path', 'facebook/opt-125m', '--per_device_train_batch_size', '2', '--per_device_eval_batch_size', '8', '--max_seq_len', '512', '--learning_rate', '9.65e-6', '--weight_decay', '0.1', '--num_train_epochs', '2', '--gradient_accumulation_steps', '1', '--lr_scheduler_type', 'cosine', '--num_warmup_steps', '0', '--seed', '1234', '--zero_stage', '2', '--deepspeed', '--output_dir', '/content/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/125m'] exits with return code = -9

mrwyattii commented 1 year ago

@le153234 there should be an output log at /content/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/125m/training.log - can you share the contents of that file?

le153234 commented 1 year ago

attach the output log 125m-training.log

le153234 commented 1 year ago

125m-training.log

puyuanOT commented 1 year ago

I got the same error with GPT-J 6B

stainswei commented 1 year ago

I got the same error, but with return code=-7

MickeyJson commented 1 year ago

I'm getting the same error code, I'm trying to use demo setp cuda:11.6 torch=1.12 cudnn=8.4.0 python=3.8

afeilulu commented 1 year ago

Same error here, No detail tips ~

wsl cat /proc/version Linux version 5.15.90.1-microsoft-standard-WSL2 (oe-user@oe-host) (x86_64-msft-linux-gcc (GCC) 9.3.0, GNU ld (GNU Binutils) 2.34.0.20200220) #1 SMP Fri Jan 27 02:56:13 UTC 2023

ds_report

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible

async_io ............... [NO] ....... [OKAY] cpu_adagrad ............ [NO] ....... [OKAY] cpu_adam ............... [NO] ....... [OKAY] fused_adam ............. [NO] ....... [OKAY] fused_lamb ............. [NO] ....... [OKAY] quantizer .............. [NO] ....... [OKAY] random_ltd ............. [NO] ....... [OKAY] [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0 [WARNING] using untested triton version (2.0.0), only 1.0.0 is known to be compatible sparse_attn ............ [NO] ....... [NO] spatial_inference ...... [NO] ....... [OKAY] transformer ............ [NO] ....... [OKAY] stochastic_transformer . [NO] ....... [OKAY] transformer_inference .. [NO] ....... [OKAY] utils .................. [NO] ....... [OKAY]

DeepSpeed general environment info: torch install path ............... ['/home/dev/.local/lib/python3.8/site-packages/torch'] torch version .................... 2.0.0+cu117 deepspeed install path ........... ['/home/dev/.local/lib/python3.8/site-packages/deepspeed'] deepspeed info ................... 0.9.1, unknown, unknown torch cuda version ............... 11.7 torch hip version ................ None nvcc version ..................... 12.1 deepspeed wheel compiled w. ...... torch 2.0, cuda 11.7

iFocusing commented 1 year ago

any updates?

dineshreddy221 commented 1 year ago

Hi,

Same issue for me, no detail information of that error in my output. If any reference link on this error or any update on this issue?

hongyix commented 1 year ago

same here

yingying123321 commented 1 year ago

Same error

Khachdallak02 commented 1 year ago

same error

RanchiZhao commented 1 year ago

same error

naginoa commented 1 year ago

same error

dsn01 commented 2 months ago

same error