microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
34.84k stars 4.05k forks source link

[BUG]ImportError: /root/.cache/torch_extensions/py310_cu117/fused_adam/fused_adam.so: undefined symbol: _ZN3c104cuda20CUDACachingAllocator9allocatorE #3410

Closed janglichao closed 1 year ago

janglichao commented 1 year ago

Describe the bug runing step2 with script:

deepspeed DeepSpeedExamples/applications/DeepSpeed-Chat/training/step2_reward_model_finetuning/main.py \ --data_split 2,4,4 \ --model_name_or_path facebook/opt-350m \ --num_padding_at_beginning 1 \ --per_device_train_batch_size 8 \ --per_device_eval_batch_size 8 \ --max_seq_len 512 \ --learning_rate 5e-5 \ --weight_decay 0.1 \ --num_train_epochs 1 \ --gradient_accumulation_steps 1 \ --lr_scheduler_type cosine \ --num_warmup_steps 0 \ --gradient_checkpointing \ --seed 1234 \ --zero_stage 0 \ --deepspeed \ --output_dir /home/kidd/projects/llms/chatGLM-6B/ChatGLM-6B/chatglm_efficient_tuning/DeepSpeedExamples/output \ &> /home/kidd/projects/llms/chatGLM-6B/ChatGLM-6B/chatglm_efficient_tuning/DeepSpeedExamples/output/rm_training.log

then got errors:

CalledProcessError: Command '['which', 'c++']' returned non-zero exit status 1.

ImportError: /root/.cache/torch_extensions/py310_cu117/fused_adam/fused_adam.so: undefined symbol: _ZN3c104cuda20CUDACachingAllocator9allocatorE

emilankerwiik commented 1 year ago

@janglichao having the same issue :)

usr/local/lib/python3.10/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: '/usr/local/lib/python3.10/dist-packages/torchvision/image.so: undefined symbol: _ZN3c104cuda20CUDACachingAllocator9allocatorE

Did you find a solution?

janglichao commented 1 year ago

@janglichao having the same issue :)

usr/local/lib/python3.10/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: '/usr/local/lib/python3.10/dist-packages/torchvision/image.so: undefined symbol: _ZN3c104cuda20CUDACachingAllocator9allocatorE

Did you find a solution?

try to run ”ds_report" then you may see some ops doesn't install on you system,fused_adam ops should be installed

niuhuluzhihao commented 1 year ago

@emilankerwiik @janglichao I have the same question.Did you find a solution? my ds_report is

(myenv) algo@algogpu:~/mzh/vicuna_0605/scripts$ ds_report
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/algo/.local/lib/python3.10/site-packages/torch']
torch version .................... 1.13.1+cu117
deepspeed install path ........... ['/home/algo/.local/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.8.3+unknown, unknown, unknown
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.8
deepspeed wheel compiled w. ...... torch 1.13, cuda 11.7
emilankerwiik commented 1 year ago

@summer-silence

If I am not mixing up my dependency issues I solved it with this. Best of luck!

!pip install -q xformers==0.0.19 torch==2.0.0+cu117 torchvision==0.15.1+cu117 torchtext==0.15.1 torchaudio==2.0.1 torchdata==0.6.0 --extra-index-url https://download.pytorch.org/whl/cu117

Solution to package problems and compile errors with different CUDA versions: updated with latest torch==2.0.0-compatible package versions. From https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/9341 and https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/4629

Unicorncosmos commented 11 months ago

@emilankerwiik hi

@summer-silence

If I am not mixing up my dependency issues I solved it with this. Best of luck!

!pip install -q xformers==0.0.19 torch==2.0.0+cu117 torchvision==0.15.1+cu117 torchtext==0.15.1 torchaudio==2.0.1 torchdata==0.6.0 --extra-index-url https://download.pytorch.org/whl/cu117

Solution to package problems and compile errors with different CUDA versions: updated with latest torch==2.0.0-compatible package versions. From AUTOMATIC1111/stable-diffusion-webui#9341 and AUTOMATIC1111/stable-diffusion-webui#4629

What Python version is used here?

Justinfungi commented 5 months ago

Clear the ./cache by rm -rf ~/.cache so that the cache file can be reset and updated. it is because you accidentally remove the source file for link file i thing

zhzihao commented 4 months ago

I solved it with this: change the transformers to 4.37.2 and flash-attn to 2.4.2