triton-inference-server / fastertransformer_backend

BSD 3-Clause "New" or "Revised" License
412 stars 133 forks source link

GPT-J streaming: getting garbage response #91

Open vax-dev opened 1 year ago

vax-dev commented 1 year ago

Description

branch: main
fastertransformer docker: 22.12

Reproduced Steps

docker run -it --rm --gpus=all --shm-size=1g --ulimit memlock=-1 -v ${WORKSPACE}:${WORKSPACE} -w ${WORKSPACE} ${TRITON_DOCKER_IMAGE} bash
# now in docker

export WORKSPACE=$(pwd)
export SRC_MODELS_DIR=${WORKSPACE}/models
git clone https://gitlab-master.nvidia.com/dl/FasterTransformer/FasterTransformer.git # Used for convert the checkpoint and triton output
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json -P models
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt -P models
wget https://mystic.the-eye.eu/public/AI/GPT-J-6B/step_383500_slim.tar.zstd
mkdir ${SRC_MODELS_DIR}/gptj/ -p
tar -axf step_383500_slim.tar.gz -C ${SRC_MODELS_DIR}/gptj/
pip install scipy
python3 ${WORKSPACE}/FasterTransformer/examples/pytorch/gptj/utils/gptj_ckpt_convert.py \
        --output-dir ${WORKSPACE}/all_models/gptj/fastertransformer/1 \
        --ckpt-dir ${SRC_MODELS_DIR}/gptj/step_383500/ \
        --n-inference-gpus 2

Enabled Decoupled mode in config.pbtx

Streaming is working but the response is garbage and context is missing from the text. The model is working fine if not use streaming, Is there any special step or parameter missing causing the issue in streaming?

@byshiue

byshiue commented 1 year ago

Please provide the scripts about how you run the streaming on GPT-J.