[BUG] trainning [ERROR] [launch.py:434:sigkill_handler] exits with return code = -9

le153234 commented 1 year ago

[2023-04-14 13:11:27,879] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 13266 [2023-04-14 13:11:27,885] [ERROR] [launch.py:434:sigkill_handler] ['/usr/bin/python3', '-u', 'main.py', '--local_rank=0', '--data_path', 'Dahoas/rm-static', 'Dahoas/full-hh-rlhf', 'Dahoas/synthetic-instruct-gptj-pairwise', 'yitingxie/rlhf-reward-datasets', 'openai/webgpt_comparisons', 'stanfordnlp/SHP', '--data_split', '2,4,4', '--model_name_or_path', 'facebook/opt-125m', '--per_device_train_batch_size', '2', '--per_device_eval_batch_size', '8', '--max_seq_len', '512', '--learning_rate', '9.65e-6', '--weight_decay', '0.1', '--num_train_epochs', '2', '--gradient_accumulation_steps', '1', '--lr_scheduler_type', 'cosine', '--num_warmup_steps', '0', '--seed', '1234', '--zero_stage', '2', '--deepspeed', '--output_dir', '/content/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/125m'] exits with return code = -9

mrwyattii commented 1 year ago

@le153234 there should be an output log at /content/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/125m/training.log - can you share the contents of that file?

le153234 commented 1 year ago

attach the output log 125m-training.log

le153234 commented 1 year ago

125m-training.log

puyuanOT commented 1 year ago

I got the same error with GPT-J 6B

stainswei commented 1 year ago

I got the same error, but with return code=-7

MickeyJson commented 1 year ago

I'm getting the same error code, I'm trying to use demo setp cuda:11.6 torch=1.12 cudnn=8.4.0 python=3.8

afeilulu commented 1 year ago

Same error here, No detail tips ~

wsl cat /proc/version Linux version 5.15.90.1-microsoft-standard-WSL2 (oe-user@oe-host) (x86_64-msft-linux-gcc (GCC) 9.3.0, GNU ld (GNU Binutils) 2.34.0.20200220) #1 SMP Fri Jan 27 02:56:13 UTC 2023

ds_report

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible

async_io ............... [NO] ....... [OKAY] cpu_adagrad ............ [NO] ....... [OKAY] cpu_adam ............... [NO] ....... [OKAY] fused_adam ............. [NO] ....... [OKAY] fused_lamb ............. [NO] ....... [OKAY] quantizer .............. [NO] ....... [OKAY] random_ltd ............. [NO] ....... [OKAY] [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0 [WARNING] using untested triton version (2.0.0), only 1.0.0 is known to be compatible sparse_attn ............ [NO] ....... [NO] spatial_inference ...... [NO] ....... [OKAY] transformer ............ [NO] ....... [OKAY] stochastic_transformer . [NO] ....... [OKAY] transformer_inference .. [NO] ....... [OKAY] utils .................. [NO] ....... [OKAY]

DeepSpeed general environment info: torch install path ............... ['/home/dev/.local/lib/python3.8/site-packages/torch'] torch version .................... 2.0.0+cu117 deepspeed install path ........... ['/home/dev/.local/lib/python3.8/site-packages/deepspeed'] deepspeed info ................... 0.9.1, unknown, unknown torch cuda version ............... 11.7 torch hip version ................ None nvcc version ..................... 12.1 deepspeed wheel compiled w. ...... torch 2.0, cuda 11.7