GPT-J streaming: getting garbage response

Description

branch: main
fastertransformer docker: 22.12

Reproduced Steps

docker run -it --rm --gpus=all --shm-size=1g --ulimit memlock=-1 -v ${WORKSPACE}:${WORKSPACE} -w ${WORKSPACE} ${TRITON_DOCKER_IMAGE} bash
# now in docker

export WORKSPACE=$(pwd)
export SRC_MODELS_DIR=${WORKSPACE}/models
git clone https://gitlab-master.nvidia.com/dl/FasterTransformer/FasterTransformer.git # Used for convert the checkpoint and triton output
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json -P models
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt -P models
wget https://mystic.the-eye.eu/public/AI/GPT-J-6B/step_383500_slim.tar.zstd
mkdir ${SRC_MODELS_DIR}/gptj/ -p
tar -axf step_383500_slim.tar.gz -C ${SRC_MODELS_DIR}/gptj/
pip install scipy
python3 ${WORKSPACE}/FasterTransformer/examples/pytorch/gptj/utils/gptj_ckpt_convert.py \
        --output-dir ${WORKSPACE}/all_models/gptj/fastertransformer/1 \
        --ckpt-dir ${SRC_MODELS_DIR}/gptj/step_383500/ \
        --n-inference-gpus 2

Enabled Decoupled mode in config.pbtx

Streaming is working but the response is garbage and context is missing from the text. The model is working fine if not use streaming, Is there any special step or parameter missing causing the issue in streaming?

@byshiue

triton-inference-server / fastertransformer_backend

GPT-J streaming: getting garbage response #91

Description

Reproduced Steps