microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
34.72k stars 4.05k forks source link

[BUG] Training failed on Huawei Ascend 910 Platform #4973

Closed ZhuoranLyu closed 7 months ago

ZhuoranLyu commented 8 months ago

Describe the bug Training Baichuan13B model failed on Huawei platform, using Deepspeed Stage 3 with CPU offload.

To Reproduce Steps to reproduce the behavior:

  1. clone this repo: https://gitee.com/ascend/ModelZoo-PyTorch/tree/master/PyTorch/built-in/foundation/Baichuan-13B
  2. Using 1 node with 8 NPUs.
  3. json config
    "zero_optimization": {
    "stage": 3,
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": false
    },
    "offload_param": {
      "device": "cpu",
      "pin_memory": false
    },
  4. run traininig script.

Expected behavior Training

ds_report output

DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
deepspeed_not_implemented  [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/js_lvzhuoran/anaconda3/envs/py310/lib/python3.10/site-packages/torch']
torch version .................... 2.1.0
deepspeed install path ........... ['/home/js_lvzhuoran/anaconda3/envs/py310/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.12.6, unknown, unknown
deepspeed wheel compiled w. ...... torch 2.1
torch_npu install path ........... ['/home/js_lvzhuoran/anaconda3/envs/py310/lib/python3.10/site-packages/torch_npu']
torch_npu version ................ 2.1.0
ascend_cann version .............. 7.0.0
shared memory (/dev/shm) size .... 1005.80 GB

Screenshots image image

It looks like it's trying to compare a tensor on CPU with another tensor on GPU/NPU?

System info (please complete the following information):

Launcher context

HCCL_CONNECT_TIMEOUT=1200 deepspeed  --num_gpus ${NUM_GPUS_PER_NODE} src/train_bash.py \
    --stage sft \
    --model_name_or_path /home/js_lvzhuoran/NLP/LLM/LLaMA-Factory/baichuan-inc/Baichuan2-13B-Chat \
    --deepspeed ./ds_config_zero2.json \
    --do_train \
    --dataset kf_sample \
    --template default \
    --finetuning_type full \
    --output_dir ./output_sft \
    --overwrite_cache \
    --overwrite_output_dir \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 16 \
    --lr_scheduler_type cosine \
    --gradient_checkpointing True \
    --logging_steps 10 \
    --save_steps 1000 \
    --learning_rate 1e-6 \
    --num_train_epochs 2.0 \
    --plot_loss \
    --max_samples 200\
    --fp16 

Docker context no

Additional context no

ZhuoranLyu commented 8 months ago

Also tested on Nvidia GPUs, wondering why it compares a tensor on CPU with a tensor on GPU, confusing:( https://github.com/microsoft/DeepSpeed/blob/870ae041d42190be8139afc12bef51d6ed7719f3/deepspeed/runtime/zero/stage3.py#L2081C37-L2081C37

ZhuoranLyu commented 8 months ago

If I move the second tensor to NPU using .to('npu'), it works smoothly.

wangshuai09 commented 8 months ago

If I move the second tensor to NPU using .to('npu'), it works smoothly.

Tensors in one op should be in the same device. Both of GPU and NPU could raise error, because the second tensor is on CPU.

ZhuoranLyu commented 8 months ago

If I move the second tensor to NPU using .to('npu'), it works smoothly.

Tensors in one op should be in the same device. Both of GPU and NPU could raise error, because the second tensor is on CPU.

However, it works fine on Nvidia GPU. Really strange.

wangshuai09 commented 8 months ago

On GPU:

>>> import torch 
>>> a = torch.zeros(1, device="cpu")
>>> b = torch.zeros(1, device="cuda")
>>> b.logical_or_(torch.isinf(a))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
ZhuoranLyu commented 8 months ago

On GPU:

>>> import torch 
>>> a = torch.zeros(1, device="cpu")
>>> b = torch.zeros(1, device="cuda")
>>> b.logical_or_(torch.isinf(a))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

...........I totally understand, but if you check the tensor in https://github.com/microsoft/DeepSpeed/blob/870ae041d42190be8139afc12bef51d6ed7719f3/deepspeed/runtime/zero/stage3.py#L2081C37-L2081C37 you'll find two tensors on different devices.

wangshuai09 commented 8 months ago

This also confused me. Don't know why the GPU can run.

misstek commented 8 months ago

Hi, @ZhuoranLyu, we have supported this operation on different devices in the latest version of torch_npu, please update the torch_npu package as follow. If this issue still exists, please provide more detailed information such as the version of CANN, torch and torch_npu. master/v2.2.0 https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/master/20231225.1/pytorch_master_py38.tar.gz https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/master/20231225.1/pytorch_master_py39.tar.gz https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/master/20231225.1/pytorch_master_py310.tar.gz v2.0.1 https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v2.0.1/20231225.2/pytorch_v2.0.1_py38.tar.gz https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v2.0.1/20231225.2/pytorch_v2.0.1_py39.tar.gz https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v2.0.1/20231225.2/pytorch_v2.0.1_py310.tar.gz v1.11.0 https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v1.11.0/20231225.2/pytorch_v1.11.0_py37.tar.gz https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v1.11.0/20231225.2/pytorch_v1.11.0_py38.tar.gz https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v1.11.0/20231225.2/pytorch_v1.11.0_py39.tar.gz https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v1.11.0/20231225.2/pytorch_v1.11.0_py310.tar.gz v2.1.0 https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v2.1.0/20231225.2/pytorch_v2.1.0_py38.tar.gz https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v2.1.0/20231225.2/pytorch_v2.1.0_py39.tar.gz https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v2.1.0/20231225.2/pytorch_v2.1.0_py310.tar.gz

minchao-sun commented 8 months ago

Hi, @ZhuoranLyu, we have supported this operation on different devices in the latest version of torch_npu, please update the torch_npu package as follow. If this issue still exists, please provide more detailed information such as the version of CANN, torch and torch_npu. master/v2.2.0 https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/master/20231225.1/pytorch_master_py38.tar.gz https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/master/20231225.1/pytorch_master_py39.tar.gz https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/master/20231225.1/pytorch_master_py310.tar.gz v2.0.1 https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v2.0.1/20231225.2/pytorch_v2.0.1_py38.tar.gz https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v2.0.1/20231225.2/pytorch_v2.0.1_py39.tar.gz https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v2.0.1/20231225.2/pytorch_v2.0.1_py310.tar.gz v1.11.0 https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v1.11.0/20231225.2/pytorch_v1.11.0_py37.tar.gz https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v1.11.0/20231225.2/pytorch_v1.11.0_py38.tar.gz https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v1.11.0/20231225.2/pytorch_v1.11.0_py39.tar.gz https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v1.11.0/20231225.2/pytorch_v1.11.0_py310.tar.gz v2.1.0 https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v2.1.0/20231225.2/pytorch_v2.1.0_py38.tar.gz https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v2.1.0/20231225.2/pytorch_v2.1.0_py39.tar.gz https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v2.1.0/20231225.2/pytorch_v2.1.0_py310.tar.gz

@ZhuoranLyu Please be noted that these are nightly build versions that may give a workaround on your issue. Remember to switch to a stable release version latter.

ZhuoranLyu commented 7 months ago

Hi, @ZhuoranLyu, we have supported this operation on different devices in the latest version of torch_npu, please update the torch_npu package as follow. If this issue still exists, please provide more detailed information such as the version of CANN, torch and torch_npu. master/v2.2.0 https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/master/20231225.1/pytorch_master_py38.tar.gz https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/master/20231225.1/pytorch_master_py39.tar.gz https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/master/20231225.1/pytorch_master_py310.tar.gz v2.0.1 https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v2.0.1/20231225.2/pytorch_v2.0.1_py38.tar.gz https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v2.0.1/20231225.2/pytorch_v2.0.1_py39.tar.gz https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v2.0.1/20231225.2/pytorch_v2.0.1_py310.tar.gz v1.11.0 https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v1.11.0/20231225.2/pytorch_v1.11.0_py37.tar.gz https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v1.11.0/20231225.2/pytorch_v1.11.0_py38.tar.gz https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v1.11.0/20231225.2/pytorch_v1.11.0_py39.tar.gz https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v1.11.0/20231225.2/pytorch_v1.11.0_py310.tar.gz v2.1.0 https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v2.1.0/20231225.2/pytorch_v2.1.0_py38.tar.gz https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v2.1.0/20231225.2/pytorch_v2.1.0_py39.tar.gz https://pytorch-package.obs.cn-north-4.myhuaweicloud.com/pta/Daily/v2.1.0/20231225.2/pytorch_v2.1.0_py310.tar.gz

Thanks a lot!