[BUG] Running DeepSpeed with MoE inference leads to CUDA illegal memory access and NaN activation

hyhuang00 commented 2 years ago

Describe the bug I tried to use the DeepSpeed package to run inference with MoE model checkpoints, but running the model leads to a CUDA error of illegal memory access. This error always happens when the model executes to the 8th/9th layer. At the mean time, any activation after the 3rd layer will become NaN.

!!!! kernel execution error. (m: 2, n: 2, k: 64, error: 13) 
!!!! kernel execution error. (m: 64, n: 2, k: 2, error: 13) 
!!!! kernel execution error. (m: 768, n: 2, k: 768, error: 14) 
!!!! kernel execution error. (m: 3072, n: 2, k: 768, error: 14) 
!!!! kernel execution error. (m: 768, n: 2, k: 3072, error: 13) 
!!!! kernel execution error. (m: 2304, n: 2, k: 768, error: 14) 
!!!! kernel execution error. (m: 2, n: 2, k: 64, error: 13) 
!!!! kernel execution error. (m: 64, n: 2, k: 2, error: 13) 
!!!! kernel execution error. (m: 768, n: 2, k: 768, error: 14) 
Traceback (most recent call last):
  File "/private/home/hyhuang/moe_custom/deepspeed-benchmark/inference/generate_samples_gpt.py", line 218, in <module>
    main()
  File "/private/home/hyhuang/moe_custom/deepspeed-benchmark/inference/generate_samples_gpt.py", line 188, in main
    generate_samples_input_from_file(model)
  File "/private/home/hyhuang/moe_custom/Megatron-DeepSpeed/megatron/text_generation_utils.py", line 169, in generate_samples_input_from_file
    for _, decode_tokens in enumerate(token_stream):
  File "/private/home/hyhuang/moe_custom/Megatron-DeepSpeed/megatron/text_generation_utils.py", line 423, in get_token_stream
    for tokens, lengths in batch_token_iterator:
  File "/private/home/hyhuang/moe_custom/Megatron-DeepSpeed/megatron/text_generation_utils.py", line 546, in sample_sequence_batch
    output, layer_past = forward_step(model, tokens2use,
  File "/private/home/hyhuang/moe_custom/Megatron-DeepSpeed/megatron/text_generation_utils.py", line 467, in forward_step
    output_tensor = model(tokens, position_ids, attention_mask,
  File "/private/home/hyhuang/.conda/envs/deepspeed/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1128, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/private/home/hyhuang/moe_custom/Megatron-DeepSpeed/megatron/model/gpt_model.py", line 120, in forward
    lm_output, *moe_losses = self.language_model(
  File "/private/home/hyhuang/.conda/envs/deepspeed/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/private/home/hyhuang/moe_custom/Megatron-DeepSpeed/megatron/model/language_model.py", line 392, in forward
    encoder_output, *moe_losses = self.encoder(encoder_input,
  File "/private/home/hyhuang/.conda/envs/deepspeed/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/private/home/hyhuang/moe_custom/Megatron-DeepSpeed/megatron/model/transformer.py", line 790, in forward
    hidden_states = layer(hidden_states,
  File "/private/home/hyhuang/.conda/envs/deepspeed/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/private/home/hyhuang/moe_custom/DeepSpeed/deepspeed/ops/transformer/inference/moe_inference.py", line 418, in forward
    residual_add = attention_output + self.attention.attn_ob
RuntimeError: CUDA error: an illegal memory access was encountered

To Reproduce I substituted the command "run_cmd" with "deepspeed generate_samples_gpt.py ${megatron_options} ${data_options} ${deepspeed_options}" in the script Megatron-DeepSpeed/examples/MoE/ds_pretrain_gpt_125M_MoE64.sh. I also reduced the number of experts to 2 for memory saving, and changed other configuration to match the number of nodes(1) and GPUs (1) I am using.

Expected behavior The generate_samples_gpt.py should be able to finish,

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
async_io ............... [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/private/home/hyhuang/.conda/envs/deepspeed/lib/python3.8/site-packages/torch']
torch version .................... 1.11.0
torch cuda version ............... 11.3
torch hip version ................ None
nvcc version ..................... 11.3
deepspeed install path ........... ['/private/home/hyhuang/moe_custom/DeepSpeed/deepspeed']
deepspeed info ................... 0.6.6+828ab718, 828ab718, master
deepspeed wheel compiled w. ...... torch 1.11, cuda 11.3

Screenshots If applicable, add screenshots to help explain your problem.

System info (please complete the following information):

OS: Ubuntu 20.04
GPU count and types: one machine with one V100 GPU
Interconnects (if applicable) [e.g., two machines connected with 100 Gbps IB]
Python version: 3.8
Any other relevant info about your setup

Launcher context With deepspeed launcher.

Docker context Are you using a specific docker image that you can share?

Additional context Add any other context about the problem here.

hyhuang00 commented 2 years ago

The problem seems to be rooted from the ds_qkv_gemm implementation under FP16. This kernel works fine when handling FP32 inputs. However, when running under FP16, only the inp_norm can be returned correctly. Would appreciate if anyone could look at the implementation under DeepSpeed/deepspeed/ops/csrc/transformer/inference/csrc/pt_binding.cpp

hyhuang00 commented 2 years ago

Here is a screenshot created by the same script with different precision. On the left is the results of a dense layer given FP32 and the right is the results of a dense layer given FP16, with --dp-inference enabled. The qkv calculated from the ds_qkv_gemm are incorrectly masked as 0s.

RezaYazdaniAminabadi commented 2 years ago

Hi @hyhuang00 Thanks for catching this. I will look into this. Can you please provide me the script you are using to run this? -Reza

hyhuang00 commented 2 years ago

Sure, here is the script I'm using. I made some modification to deepspeed/module_inject/replace_module.py to ensure the args and flags are respected by the deepspeed.init_inference() function. Besides the fp16 and kernel execution error reported above, I also got failed data parallelism and all-to-all halting problem as reported in the discussion.

#!/bin/bash
DIR=/home/hyhuang/moe_output/deepspeedlm_15B
###############################################################################
### Main configs
## GPT-3 models use 2K sequence length/context window
SEQ_LEN=2048
NUM_GPUS=8 
NUM_NODES=1
EP_SIZE=8 # EP=1 is normal TFMR

### The "GPT-3 XXX" below are configs from GPT-3 paper
### https://arxiv.org/abs/2005.14165, choose based on
### your desired model size or build your own configs

## GPT-3 Small 125M
MODEL_SIZE=0.125
NUM_LAYERS=12
HIDDEN_SIZE=768
NUM_ATTN_HEADS=12
GLOBAL_BATCH_SIZE=256
# LR=6.0e-4
# MIN_LR=6.0e-5

###############################################################################
### Training duration configs
## The main termination condition, original GPT-3 paper trains for 300B tokens
## For MoE model, we found sometimes training a bit more to 330B tokens helps
TRAIN_TOKENS=300000000000
# TRAIN_TOKENS=330000000000

## TRAIN_ITERS is another termination condition and also affect the number of 
## data samples to be indexed. Since we want to reach the TRAIN_TOKENS
## above, and techniques like curriculum learning has less token in some steps,
## so we just set this config large enough to make sure we have enough
## processed data and don't terminate by TRAIN_ITERS.
TRAIN_ITERS=$(( ${TRAIN_TOKENS} * 3 / ${GLOBAL_BATCH_SIZE} / ${SEQ_LEN} ))

## Another termination condition in minutes. Set it large enough to avoid
## undesired early termination.
EXIT_DURATION=30000000
###############################################################################
### LR configs
## LR warmup and decay duration, this token-based config is preferable since
## no need to readjust when the batch size/seqlen is changed.
## Original GPT-3 paper uses 375M warmup tokens and 260B decay tokens.
## For MoE model, we found that setting the decay token to 300B helps.
WARMUP_TOKENS=375000000
# LR_DECAY_TOKENS=260000000000
LR_DECAY_TOKENS=300000000000
###############################################################################
### Parallelism configs
## Micro batch size per GPU
## Make sure that BATCH_SIZE <= GLOBAL_BATCH_SIZE*PP_SIZE*MP_SIZE/NUM_GPUS
BATCH_SIZE=1

## Model parallelism, 1 is no MP
## Currently MoE models have divergence issue when MP > 1.
MP_SIZE=1

## Pipeline parallelism
## Currently we don't support PP for MoE. To disable PP, set PP_SIZE
## to 1 and use the "--no-pipeline-parallel" arg.
PP_SIZE=1
###############################################################################
### MoE configs
## Number of experts. EP_SIZE 1 means dense model without MoE
# EP_SIZE=1

if [[ $EP_SIZE -gt $NUM_GPUS ]]; then
    EP_PARALLEL_SIZE=$NUM_GPUS
else
    EP_PARALLEL_SIZE=$EP_SIZE
fi

## Original GPT-3 model always set min LR at 10% of max LR. For MoE model, we
## found that lower LR and min LR (than the base dense model) helps.
## For 1.3B MoE-128 model we used LR=1.2e-4 and MIN_LR=1.0e-6.
## For 350M MoE-128 model we used LR=2.0e-4 and MIN_LR=2.0e-6, but they are not
## heavily tuned.
LR=2.0e-4
MIN_LR=2e-06

## Coefficient for MoE loss. We find that 0.01 is a good value at least for
## 1.3B MoE-128 model
MLC=0.01

## Below configs adjust the MoE expert token capacity limit during training and
## eval. To completely disable capacity limit, set MOE_DROP_TOKEN to false.
## Larger capacity factor or disabling capacity limit could improve training
## convergence, but will also reduce training throughput.
MOE_TRAIN_CAP_FACTOR=2.0 # Temporary fix to 10
MOE_EVAL_CAP_FACTOR=2.0 # Temporary fix to 10
MOE_MIN_CAP=0 # for small batchsize
MOE_DROP_TOKEN="true"
# MOE_DROP_TOKEN="false" # Try it to see if it solves the problem -- nope -- previously this option will get overridden by default
###############################################################################
### Curriculum learning (CL) configs
## Enable/disable CL
CL_ENABLED="false"
## Consult the tutorial https://www.deepspeed.ai/tutorials/curriculum-learning/
## for tuning the following configs
CL_START_SEQLEN=80
CL_AVG_SEQLEN=$(( (${CL_START_SEQLEN} + ${SEQ_LEN}) / 2 ))
CL_TOKENS=60
CL_TOKENS=$((${CL_TOKENS} * 1000000000))
CL_STEP=$(( ${CL_TOKENS} / (${GLOBAL_BATCH_SIZE} * ${CL_AVG_SEQLEN}) ))
###############################################################################
### Misc configs
LOG_INTERVAL=10
EVAL_ITERS=10
EVAL_INTERVAL=100
SAVE_INTERVAL=10000

## Standard deviation for weight initialization
## We used 0.014 for 350M/1.3B dense/MoE models, and used 0.01 for 6.7B
## dense model. Usually larger model needs lower std.
INIT_STD=0.014
# INIT_STD=0.01

## Activation checkpointing saves GPU memory, but reduces training speed
# ACTIVATION_CHECKPOINT="true"
ACTIVATION_CHECKPOINT="false"
###############################################################################
### Output and data configs
current_time=$(date +"%y%m%d%H%M%S")
host="${HOSTNAME}"
NAME="gpt-${MODEL_SIZE}B-lr-${LR}-minlr-${MIN_LR}-bs-${GLOBAL_BATCH_SIZE}-gpus-${NUM_GPUS}-mp-${MP_SIZE}-pp-${PP_SIZE}"
if [[ $EP_SIZE -gt 1 ]]; then
    NAME="${NAME}-ep-${EP_SIZE}-mlc-${MLC}-cap-${MOE_TRAIN_CAP_FACTOR}-drop-${MOE_DROP_TOKEN}"
fi
if [ "${CL_ENABLED}" = "true" ]; then
    NAME="${NAME}-cl-${CL_START_SEQLEN}-${CL_STEP}"
fi

OUTPUT_BASEPATH=$DIR/trial/$current_time
mkdir -p ${OUTPUT_BASEPATH}
mkdir -p "${OUTPUT_BASEPATH}/tensorboard/"
mkdir -p "${OUTPUT_BASEPATH}/checkpoint/"
# mkdir -p "${OUTPUT_BASEPATH}/log/"
TENSORBOARD_DIR="${OUTPUT_BASEPATH}/tensorboard/"
mkdir -p ${TENSORBOARD_DIR} 
## Note that for MoE model with billion-scale base model, the checkpoint can be
## as large as TB-scale which normal NFS cannot handle efficiently.

# USE_INTERNAL_DATA="true"
USE_INTERNAL_DATA="false"

if [ "${USE_INTERNAL_DATA}" = "true" ]; then
    ## The internal data is only accessible within Microsoft
    ## For cluster Azure-EastUS-V100-32GB-4, Azure-WestUS3-A100
    # BASE_DATA_PATH=/vc_data/Megatron-LM/data
    # DATA_HOME="/vc_data/pile-cc1-cc2-shuf"
    ## For cluster Lab-RR1-V100
    BASE_DATA_PATH=/data/Megatron-LM/data
    DATA_HOME="/turing-ssd/users/conglli/data/pile-cc1-cc2-shuf"
    ## For cluster Azure-CentralUS-A100
    # BASE_DATA_PATH=/data/Megatron-LM/data
    # DATA_HOME=/vc_data_1/users/amawa/blended

    VOCAB_PATH=${BASE_DATA_PATH}/gpt2-vocab.json
    MERGE_PATH=${BASE_DATA_PATH}/gpt2-merges.txt
    ARX="${DATA_HOME}/ArXiv_ftfy_cleaned_id_shuf_text_document"
    BC2="${DATA_HOME}/BookCorpus2_ftfy_cleaned_id_shuf_text_document"
    B3="${DATA_HOME}/Books3_ftfy_cleaned_id_shuf_text_document"
    CC2020="${DATA_HOME}/CC-2020-50_id_cleaned_shuf_text_document"
    CC2021="${DATA_HOME}/CC-2021-04_id_cleaned_shuf_text_document"
    GIT="${DATA_HOME}/Github_ftfy_id_shuf_text_document"
    GUT="${DATA_HOME}/Gutenberg_PG-19_ftfy_cleaned_id_cleaned_shuf_text_document"
    NIH="${DATA_HOME}/NIH_ExPorter_ftfy_id_shuf_text_document"
    OWT2="${DATA_HOME}/OpenWebText2_ftfy_cleaned_id_shuf_text_document"
    PCC="${DATA_HOME}/Pile-CC_id_cleaned_shuf_text_document"
    PM="${DATA_HOME}/PubMed_Abstracts_ftfy_id_shuf_text_document"
    RN="${DATA_HOME}/rn_dedup_shuf_cleaned_0.7_cleaned_shuf_text_document"
    SE="${DATA_HOME}/StackExchange_ftfy_id_shuf_text_document"
    ST="${DATA_HOME}/stories_dedup0.7_shuf_cleaned_shuf_text_document"
    WIK="${DATA_HOME}/Wikipedia_en_ftfy_id_shuf_text_document"
    DATA_BLEND="0.14336 ${B3} 0.08962 ${RN} 0.19336 ${OWT2} 0.05689 ${SE} \
    0.00859 ${ST} 0.02897 ${PM} 0.04771 ${WIK} 0.00873 ${GUT} 0.01007 ${BC2} \
    0.00208 ${NIH} 0.13017 ${CC2020} 0.09446 ${PCC} 0.15652 ${CC2021} \
    0.01359 ${ARX} 0.01588 ${GIT}"
else
    VOCAB_PATH=/home/hyhuang/moe_custom/deepspeed/checkpoint_creation/gpt2_hf_vocab.json
    MERGE_PATH=/home/hyhuang/moe_custom/deepspeed/checkpoint_creation/gpt2_hf_merges.txt
    # Public the Pile dataset, can be downloaded at https://mystic.the-eye.eu/public/AI/pile_neox/
    # DATA_BLEND=/datasets01/wikitext/060817/wikitext-2/wiki.test.tokens
    DATA_BLEND=/home/hyhuang/moe_custom/deepspeed/inference/sample.txt
fi
###############################################################################
data_options=" \
         --vocab-file ${VOCAB_PATH} \
         --merge-file ${MERGE_PATH} \
         --data-path ${DATA_BLEND} \
         --data-impl mmap"

megatron_options=" \
        --override-lr-scheduler \
        --adam-beta1 0.9 \
        --adam-beta2 0.95 \
        --tensor-model-parallel-size ${MP_SIZE} \
        --moe-expert-parallel-size ${EP_PARALLEL_SIZE} \
        --num-experts ${EP_SIZE} \
        --moe-loss-coeff ${MLC} \
        --moe-train-capacity-factor ${MOE_TRAIN_CAP_FACTOR} \
        --moe-eval-capacity-factor ${MOE_EVAL_CAP_FACTOR} \
        --moe-min-capacity ${MOE_MIN_CAP} \
        --init-method-std ${INIT_STD} \
        --lr-decay-tokens ${LR_DECAY_TOKENS} \
        --lr-warmup-tokens ${WARMUP_TOKENS} \
        --micro-batch-size ${BATCH_SIZE} \
        --exit-duration-in-mins ${EXIT_DURATION} \
        --global-batch-size ${GLOBAL_BATCH_SIZE} \
        --num-layers ${NUM_LAYERS} \
        --hidden-size ${HIDDEN_SIZE} \
        --num-attention-heads ${NUM_ATTN_HEADS} \
        --seq-length ${SEQ_LEN} \
        --max-position-embeddings ${SEQ_LEN} \
        --train-tokens ${TRAIN_TOKENS} \
        --train-iters ${TRAIN_ITERS} \
        --lr ${LR} \
        --min-lr ${MIN_LR} \
        --lr-decay-style cosine \
        --split 98,2,0 \
        --log-interval ${LOG_INTERVAL} \
        --eval-interval ${EVAL_INTERVAL} \
        --eval-iters ${EVAL_ITERS} \
        --save-interval ${SAVE_INTERVAL} \
        --weight-decay 0.1 \
        --clip-grad 1.0 \
        --hysteresis 2 \
        --num-workers 0 \
        --tensorboard-queue-size 1 \
        --log-timers-to-tensorboard \
        --log-batch-size-to-tensorboard \
        --log-validation-ppl-to-tensorboard \
        --tensorboard-dir ${TENSORBOARD_DIR}"

if [ "${ACTIVATION_CHECKPOINT}" = "true" ]; then
megatron_options="${megatron_options} \
        --checkpoint-activations"
fi

if [[ $EP_SIZE -gt 1 ]]; then
megatron_options="${megatron_options} \
        --create-moe-param-group"
fi

if [ "${MOE_DROP_TOKEN}" = "false" ]; then
megatron_options="${megatron_options} \
        --disable-moe-token-dropping"
fi

# CHECK FP16
template_json="ds_config_gpt_TEMPLATE.json"
config_json="ds_config_gpt_${NAME}.json"
sed "s/CONFIG_BATCH_SIZE/${GLOBAL_BATCH_SIZE}/" ${template_json} \
    | sed "s/CONFIG_MBSIZE/${BATCH_SIZE}/" \
    | sed "s/LOG_INTERVAL/${LOG_INTERVAL}/" \
    | sed "s/ZERO_STAGE/0/" \
    | sed "s/PRESCALE_GRAD/true/" \
    | sed "s/CONFIG_FP16_ENABLED/true/" \
    | sed "s/CONFIG_BF16_ENABLED/false/" \
    | sed "s/CONFIG_CL_ENABLED/${CL_ENABLED}/" \
    | sed "s/CONFIG_CL_MIN/${CL_START_SEQLEN}/" \
    | sed "s/CONFIG_CL_MAX/${SEQ_LEN}/" \
    | sed "s/CONFIG_CL_DURATION/${CL_STEP}/" \
      > ${config_json}

deepspeed_options=" \
            --deepspeed \
            --deepspeed_config ${config_json} \
            --pipeline-model-parallel-size ${PP_SIZE}"

# Currently MoE is not compatible with pipeline parallel
if [[ $EP_SIZE -gt 1 ]]; then
deepspeed_options="${deepspeed_options} \
        --no-pipeline-parallel"
fi

if [ "${ACTIVATION_CHECKPOINT}" = "true" ]; then
deepspeed_options="${deepspeed_options} \
        --deepspeed-activation-checkpointing"
fi

custom_options=" \
--expert-interval 2 \
--topk 1 \
--out-seq-length 10 \
--ds-inference \
--fp16
" 
# --use-tutel \
# Generate 10 tokens only
# --fp16 \ # FP16 is numerically unstable
# Disable ds inference to resolve nan problems
export CUDA_LAUNCH_BLOCKING=1 # set launch blocking
echo "CUDA_BLOCKING=$CUDA_LAUNCH_BLOCKING"
echo $CUDA_HOME

SUB_DIR="deepspeedlm_15B/inference_trial"
TIMESTAMP=$(date +"%y%m%d%H%M%S")
OUT_DIR=/home/hyhuang/moe_output/$SUB_DIR/$TIMESTAMP
mkdir $OUT_DIR
echo $OUT_DIR
run_cmd="deepspeed /home/hyhuang/moe_custom/deepspeed/inference/generate_samples_gpt.py ${megatron_options} ${data_options} ${deepspeed_options} ${custom_options} > ${OUT_DIR}/eval.out 2> ${OUT_DIR}/eval.err"
echo ${run_cmd}
eval ${run_cmd}
set +x

Gabriel4256 commented 2 years ago

Same kind of problem occurs when I run generate_text.sh here. I've also tried the same thing with nvcr.io/nvidia/pytorch:20.12-py3 docker image, but same error occured.

error log

.
.
.
> DeepSpeed Inference engine initialized
 ** On entry to GEMM_EX  parameter number 18 had an illegal value
!!!! kernel execution error. (m: 0, n: 1, k: 128, error: 7) 
 ** On entry to GEMM_EX  parameter number 13 had an illegal value
!!!! kernel execution error. (m: 128, n: 1, k: 0, error: 7) 
 ** On entry to GEMM_EX  parameter number 18 had an illegal value
!!!! kernel execution error. (m: 0, n: 1, k: 128, error: 7) 
 ** On entry to GEMM_EX  parameter number 13 had an illegal value
!!!! kernel execution error. (m: 128, n: 1, k: 0, error: 7) 
 ** On entry to GEMM_EX  parameter number 18 had an illegal value
!!!! kernel execution error. (m: 0, n: 1, k: 128, error: 7) 
 ** On entry to GEMM_EX  parameter number 13 had an illegal value
!!!! kernel execution error. (m: 128, n: 1, k: 0, error: 7) 
 ** On entry to GEMM_EX  parameter number 18 had an illegal value
!!!! kernel execution error. (m: 0, n: 1, k: 128, error: 7) 
 ** On entry to GEMM_EX  parameter number 13 had an illegal value
!!!! kernel execution error. (m: 128, n: 1, k: 0, error: 7) 
 ** On entry to GEMM_EX  parameter number 18 had an illegal value
!!!! kernel execution error. (m: 0, n: 1, k: 128, error: 7) 
 ** On entry to GEMM_EX  parameter number 13 had an illegal value
!!!! kernel execution error. (m: 128, n: 1, k: 0, error: 7) 
 ** On entry to GEMM_EX  parameter number 18 had an illegal value
!!!! kernel execution error. (m: 0, n: 1, k: 128, error: 7) 
 ** On entry to GEMM_EX  parameter number 13 had an illegal value
!!!! kernel execution error. (m: 128, n: 1, k: 0, error: 7) 
 ** On entry to GEMM_EX  parameter number 18 had an illegal value
!!!! kernel execution error. (m: 0, n: 1, k: 128, error: 7) 
 ** On entry to GEMM_EX  parameter number 13 had an illegal value
!!!! kernel execution error. (m: 128, n: 1, k: 0, error: 7) 
 ** On entry to GEMM_EX  parameter number 18 had an illegal value
!!!! kernel execution error. (m: 0, n: 1, k: 128, error: 7) 
 ** On entry to GEMM_EX  parameter number 13 had an illegal value
!!!! kernel execution error. (m: 128, n: 1, k: 0, error: 7) 
 ** On entry to GEMM_EX  parameter number 18 had an illegal value
!!!! kernel execution error. (m: 0, n: 1, k: 128, error: 7) 
 ** On entry to GEMM_EX  parameter number 13 had an illegal value
!!!! kernel execution error. (m: 128, n: 1, k: 0, error: 7) 
 ** On entry to GEMM_EX  parameter number 18 had an illegal value
!!!! kernel execution error. (m: 0, n: 1, k: 128, error: 7) 
 ** On entry to GEMM_EX  parameter number 13 had an illegal value
!!!! kernel execution error. (m: 128, n: 1, k: 0, error: 7) 
 ** On entry to GEMM_EX  parameter number 18 had an illegal value
!!!! kernel execution error. (m: 0, n: 1, k: 128, error: 7) 
 ** On entry to GEMM_EX  parameter number 13 had an illegal value
!!!! kernel execution error. (m: 128, n: 1, k: 0, error: 7) 
 ** On entry to GEMM_EX  parameter number 18 had an illegal value
!!!! kernel execution error. (m: 0, n: 1, k: 128, error: 7) 
 ** On entry to GEMM_EX  parameter number 13 had an illegal value
!!!! kernel execution error. (m: 128, n: 1, k: 0, error: 7) 
Traceback (most recent call last):
  File "/home/ubuntu/frameworks/Megatron-DeepSpeed/tools/generate_samples_gpt.py", line 168, in <module>
    main()
  File "/home/ubuntu/frameworks/Megatron-DeepSpeed/tools/generate_samples_gpt.py", line 144, in main
    generate_and_write_samples_unconditional(model, latencies, single_token_latency, model_latencies)
  File "/home/ubuntu/frameworks/Megatron-DeepSpeed/megatron/text_generation_utils.py", line 378, in generate_and_write_samples_unconditional
    for datum in generate_samples_unconditional(model, latencies=latencies, model_latencies=model_latencies, single_token_latency=single_token_latency):
  File "/home/ubuntu/frameworks/Megatron-DeepSpeed/megatron/text_generation_utils.py", line 338, in generate_samples_unconditional
    for token_stream in get_token_stream(model,
  File "/home/ubuntu/frameworks/Megatron-DeepSpeed/megatron/text_generation_utils.py", line 423, in get_token_stream
    for tokens, lengths in batch_token_iterator:
  File "/home/ubuntu/frameworks/Megatron-DeepSpeed/megatron/text_generation_utils.py", line 543, in sample_sequence_batch
    output, layer_past = forward_step(model, tokens2use,
  File "/home/ubuntu/frameworks/Megatron-DeepSpeed/megatron/text_generation_utils.py", line 467, in forward_step
    output_tensor = model(tokens, position_ids, attention_mask,
  File "/home/ubuntu/miniconda3/envs/megatron/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1120, in _call_impl
    result = forward_call(*input, **kwargs)
  File "/home/ubuntu/frameworks/Megatron-DeepSpeed/megatron/model/distributed.py", line 71, in forward
    return self.module(*inputs, **kwargs)
  File "/home/ubuntu/miniconda3/envs/megatron/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/frameworks/Megatron-DeepSpeed/megatron/model/module.py", line 172, in forward
    outputs = self.module(*inputs, **kwargs)
  File "/home/ubuntu/miniconda3/envs/megatron/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/frameworks/Megatron-DeepSpeed/megatron/model/gpt_model.py", line 120, in forward
    lm_output, *moe_losses = self.language_model(
  File "/home/ubuntu/miniconda3/envs/megatron/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/frameworks/Megatron-DeepSpeed/megatron/model/language_model.py", line 389, in forward
    encoder_output, *moe_losses = self.encoder(encoder_input,
  File "/home/ubuntu/miniconda3/envs/megatron/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/frameworks/Megatron-DeepSpeed/megatron/model/transformer.py", line 784, in forward
    hidden_states = layer(hidden_states,
  File "/home/ubuntu/miniconda3/envs/megatron/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/miniconda3/envs/megatron/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/moe_inference.py", line 434, in forward
    dispatched_attention, combined_weights = self.moe_gate_einsum(attention_output)
  File "/home/ubuntu/miniconda3/envs/megatron/lib/python3.9/site-packages/deepspeed/ops/transformer/inference/moe_inference.py", line 320, in moe_gate_einsum
    _, combined_weights, dispatch_mask, _ = self.moe_gate(
  File "/home/ubuntu/miniconda3/envs/megatron/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/miniconda3/envs/megatron/lib/python3.9/site-packages/deepspeed/moe/sharded_moe.py", line 417, in forward
    gate_output = top1gating(
  File "/home/ubuntu/miniconda3/envs/megatron/lib/python3.9/site-packages/deepspeed/moe/sharded_moe.py", line 212, in top1gating
    exp_counts = torch.sum(mask1, dim=0).detach().to('cpu')
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
terminate called after throwing an instance of 'c10::CUDAError'
  what():  CUDA error: an illegal memory access was encountered

System info (please complete the following information): OS: Ubuntu 22.04 GPU count and types: one machine with one rtx 3090 GPU Python version: 3.9

Below is the result of ds_report:

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/ubuntu/miniconda3/envs/megatron/lib/python3.9/site-packages/torch']
torch version .................... 1.10.2
torch cuda version ............... 11.1
torch hip version ................ None
nvcc version ..................... 11.1
deepspeed install path ........... ['/home/ubuntu/miniconda3/envs/megatron/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.6.7, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.12, cuda 11.6

lekurile commented 2 years ago

Hi @hyhuang00,

I believe this issue may have been resolved with PR 2212 in the DeepSpeed repo.

Can you please try installing the latest version of DeepSpeed and running again to see if that resolves the issue?

Thanks, Lev

jeffra commented 2 years ago

@hyhuang00 please re-open if you are still having an issue.

microsoft / DeepSpeed

[BUG] Running DeepSpeed with MoE inference leads to CUDA illegal memory access and NaN activation #2030