tritonserver return error result for codellama

triton-inference-server / tensorrtllm_backend

The Triton TensorRT-LLM Backend

Apache License 2.0

592 stars 84 forks source link

tritonserver return error result for codellama #98

Open Lzhang-hub opened 8 months ago

Lzhang-hub commented 8 months ago

I lanch the tritonserver follow readme with codellama-7b-hf, and request through http.

curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "write a quick sort", "max_tokens": 20, "bad_words": "", "stop_words": ""}'

get the result:

{"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"<s> write a quick sort Jane Jane Jane Jane Jane Jane Jane Jane Jane Jane Jane Jane Jane Jane Jane Jane Jane Jane Jane Jane"}

I launch the run.py get the result:

Input: "write a quick sort"
Output: "algorithm in C++.

I have a vector of integers and I want to sort it using quick sort.

I have written the following code:

\begin{code}
#include <iostream>
#include <vector>

using namespace std;

void quickSort(vector<int> &v, int left, int right)
{
    int i = left, j = right;
    int pivot = v[left];

    while (i < j)
    {
        while (v[j] >= pivot)
            j--;
        while (v[i] <= pivot)
            i++;
        if (i < j)
        {
            int temp = v[i];
            v[i] = v[j];
            v[j] = temp;
        }
    }

byshiue commented 8 months ago

For run.py, do you mean running tensorrt_llm python runtime directly, and don't use the tensorrt_llm triton backend?

Lzhang-hub commented 8 months ago

yes, the run.py in Tensorrt-LLM repo. https://github.com/NVIDIA/TensorRT-LLM/blob/release/0.5.0/examples/llama/run.py

byshiue commented 8 months ago

Can you share the end to end reproduced steps on tensorrt_llm side? We have lots of issues and limited resources. So, it is helpful to share a clear reproduced steps to help us finding the issue.

Lzhang-hub commented 8 months ago

1、 build engine

python ../build.py --model_dir codellama/CodeLlama-7b-hf --dtype float16 \
    --remove_input_padding --use_gpt_attention_plugin float16 --use_gemm_plugin float16  --output_dir ./codellama_7b-fp16 --rotary_base 1000000 --vocab_size 32016   --use_weight_only  --use_inflight_batching --paged_kv_cache

2、model_repo

cp -r all_models/inflight_batcher_llm/* triton_model_repo/
cp ./codellama_7b-fp16 /*  triton_model_repo/tensorrt_llm/1

modified tokenizer_dir and tokenizer_type in triton_model_repo/preprocessing/config.pbtxt and triton_model_repo/postprocessing/config.pbtxt

set decoupled to True and gpt_model_type to inflight_fused_batching , gpt_model_path with triton_model_repo/tensorrt_llm/1

3、lanch tritonserver with tritonserver --model-repository=triton_model_repo

4、request server with http: curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "write a quick sort", "max_tokens": 20, "bad_words": "", "stop_words": ""}'

byshiue commented 8 months ago

What tokenizer_type do you use?

Lzhang-hub commented 8 months ago

What tokenizer_type do you use?

llama

rakib-hasan commented 8 months ago

Thanks for reporting this @Lzhang-hub . Investigating it. Will provide an update once I have a fix.