microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
35.7k stars 4.15k forks source link

[BUG] "Deepspeed: command not found" when I run shell to train my model #3308

Closed forestbat closed 1 year ago

forestbat commented 1 year ago

Describe the bug I want to use deepspeed by a script, and I installed it with pip:

(base) forestbat@vm-jupyterhub-server:~/BELLE/train$ pip install deepspeed
Defaulting to user installation because normal site-packages is not writeable
Collecting deepspeed
Using cached deepspeed-0.9.0-py3-none-any.whl
……
Successfully installed deepspeed-0.9.0

But when I try to run my shell, it tells me this:

(base) forestbat@vm-jupyterhub-server:~/BELLE/train$ bash training_scripts/single_node/run_FT.sh
training_scripts/single_node/run_FT.sh: line 17: deepspeed: command not found

and there is no deepspeed in my conda list. This is my script of training:

#!/bin/bash
# Copyright (c) Microsoft Corporation.
# SPDX-License-Identifier: Apache-2.0

# DeepSpeed Team
OUTPUT=$1
ZERO_STAGE=$2
if [ "$OUTPUT" == "" ]; then
    OUTPUT=./output
fi
if [ "$ZERO_STAGE" == "" ]; then
    ZERO_STAGE=3
fi
mkdir -p $OUTPUT
#bigscience/bloomz-1b7

deepspeed main.py \
   --sft_only_data_path BELLE/train_2M_CN.json \
   --model_name_or_path dalai/alpaca/models/7B/ggml-model-q4_0.bin \
   --per_device_train_batch_size 1 \
   --per_device_eval_batch_size 2 \
   --max_seq_len 1024 \
   --learning_rate 5e-6 \
   --weight_decay 0.0001 \
   --num_train_epochs 2  \
   --gradient_accumulation_steps 8 \
   --lr_scheduler_type cosine \
   --num_warmup_steps 100 \
   --seed 1234 \
   --gradient_checkpointing \
   --zero_stage $ZERO_STAGE \
   --deepspeed \
   --output_dir $OUTPUT \
#    &> $OUTPUT/training.log

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 but detected 2.0
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/forestbat/.local/lib/python3.9/site-packages/torch']
torch version .................... 2.0.0+cu117
deepspeed install path ........... ['/home/forestbat/.local/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.9.0, unknown, unknown
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.7
deepspeed wheel compiled w. ...... torch 0.0, cuda 0.0

System info (please complete the following information):

mrwyattii commented 1 year ago

@forestbat thanks for reporting this. Could you please run which deepspeed to determine the location of the DeepSpeed executable and share that? It would appear that the executable has not been added to your path - however you were able to run ds_report, which is another executable script that DeepSpeed installs. Do you use bash, zsh, csh, or some alternative?

forestbat commented 1 year ago

@forestbat thanks for reporting this. Could you please run which deepspeed to determine the location of the DeepSpeed executable and share that? It would appear that the executable has not been added to your path - however you were able to run ds_report, which is another executable script that DeepSpeed installs. Do you use bash, zsh, csh, or some alternative?

In fact I can't run ds_report, report which I put here is generated by python -m deepspeed.env_report. And I can't get any information from which deepspeed:

(base) forestbat@vm-jupyterhub-server:~$ which deepspeed
(base) forestbat@vm-jupyterhub-server:~$ 
forestbat commented 1 year ago

I changed a new conda environment, now it works correctly.

sohaibsoussi commented 9 months ago

I changed a new conda environment, now it works correctly.

I would like to know how you did it in ubuntu

forestbat commented 9 months ago

I changed a new conda environment, now it works correctly.

I would like to know how you did it in ubuntu

conda init bash
~/.bashrc
conda activate xxx