microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
35.34k stars 4.1k forks source link

[BUG] DeepSpeedTransformerInference throws kernel execution error. #4296

Open ramyaprabhu-alt opened 1 year ago

ramyaprabhu-alt commented 1 year ago

Describe the bug image

I was trying to run the above given script and I run into this error:

image

I don't know how to even start debugging to understand where the problem is

To Reproduce Steps to reproduce the behavior: just run the code in the first screenshot. And no changes were made to DS

System info (please complete the following information):

mrwyattii commented 1 year ago

@ramyaprabhu-alt can you please copy and paste the code you are running to reproduce the bug so I don't have to re-write it? Thanks

ramyaprabhu-alt commented 1 year ago
from deepspeed.ops.transformer.inference.config import DeepSpeedInferenceConfig
from deepspeed.model_implementations.transformers.ds_transformer import DeepSpeedTransformerInference
import torch
import deepspeed

config = DeepSpeedInferenceConfig(
                                  hidden_size=5,
                                  intermediate_size = 20,
                                  heads=1,
                                  dtype=torch.float32,
                                  pre_layer_norm = False
                                 )
model = DeepSpeedTransformerInference(config=config)
from numpy import random

x = random.randint(100, size=(1,1,5))
print(x)
model(torch.Tensor(x))
print(deepspeed.__version__)
mrwyattii commented 1 year ago

I just ran the reproducer you share and I'm unable to replicate this error (using latest DeepSpeed, CUDA 11.8, Torch 2.0, A6000 GPU). Could you share the output of ds_report? Thank you

cupertank commented 1 year ago

I have the same issue on A100 80GB driver version 535.104.12, CUDA 11.7, Torch 1.13.1, deepspeed built from master. I ran the same script like this: CUDA_LAUNCH_BLOCKING=1 python test.py My ds_report:

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [YES] ...... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/ilya_vologin/llama_deepspeed/venv/lib/python3.8/site-packages/torch']
torch version .................... 1.13.1+cu117
deepspeed install path ........... ['/home/ilya_vologin/llama_deepspeed/venv/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.10.4+f8d3ec7f, f8d3ec7f, master
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.7
deepspeed wheel compiled w. ...... torch 1.13, cuda 11.7
shared memory (/dev/shm) size .... 83.53 GB

Output:

[2023-09-25 17:46:41,742] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed-Inference config: {'layer_id': 0, 'hidden_size': 5, 'intermediate_size': 20, 'heads': 1, 'num_hidden_layers': -1, 'dtype': torch.float32, 'pre_layer_norm': False, 'norm_type': <NormType.LayerNorm: 1>, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-12, 'mp_size': 1, 'scale_attention': True, 'triangular_masking': True, 'local_attention': False, 'window_size': 256, 'rotary_dim': -1, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': True, 'mlp_act_func_type': <ActivationFuncType.GELU: 1>, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': False, 'max_out_tokens': 1024, 'min_out_tokens': 1, 'scale_attn_by_inverse_layer_idx': False, 'enable_qkv_quantization': False, 'use_mup': False, 'return_single_tuple': False, 'set_empty_params': False, 'transposed_mode': False, 'use_triton': False, 'triton_autotune': False, 'num_kv': -1}
[[[59 99 69 82 61]]]
------------------------------------------------------
Free memory : 78.584106 (GigaBytes)  
Total memory: 79.151001 (GigaBytes)  
Requested memory: 0.005371 (GigaBytes) 
Setting maximum total tokens (input + output) to 1024 
WorkSpace: 0x7f3631c00000 
------------------------------------------------------
!!!! kernel execution error. (m: 15, n: 1, k: 5, error: 13) 
!!!! kernel execution error. (batch: 1, m: 1, n: 1, k: 5, error: 13) 
!!!! kernel execution error. (batch: 1, m: 5, n: 1, k: 1, error: 13) 
!!!! kernel execution error. (m: 5, n: 1, k: 5, error: 13) 
!!!! kernel execution error. (m: 20, n: 1, k: 5, error: 13) 
!!!! kernel execution error. (m: 5, n: 1, k: 20, error: 13) 
0.10.4+f8d3ec7f
mrwyattii commented 1 year ago

I am able to replicate the error now. I needed to add CUDA_LAUNCH_BLOCKING=1 otherwise I did not see the kernel execution error. It seems this error is happening in ds_linear_layer: https://github.com/microsoft/DeepSpeed/blob/0636c74c5e27757d48f64f33f330d7bb975fc5a8/csrc/transformer/inference/csrc/pt_binding.cpp#L1097

@RezaYazdaniAminabadi any ideas?

haoranlll commented 4 months ago

I encountered the same issue on V100. What should i do to solve the preblem? Thank you for your help