triton-inference-server / tensorrtllm_backend

The Triton TensorRT-LLM Backend
Apache License 2.0
653 stars 93 forks source link

The stop_words still does not work with the latest tensorrtllm_backend and TensorRT-LLM #128

Open activezhao opened 10 months ago

activezhao commented 10 months ago

First:

I download the latest tensorrtllm_backend of main branch.

git clone -b main  https://github.com/triton-inference-server/tensorrtllm_backend.git

Second:

I execute the following command to build a docker image using the latest tensorrtllm_backend of main branch.

# Update the submodules
cd tensorrtllm_backend
git lfs install
git submodule update --init --recursive
# Use the Dockerfile to build the backend in a container
# For x86_64
DOCKER_BUILDKIT=1 docker build -t triton_trt_llm -f dockerfile/Dockerfile.trt_llm_backend .

Third:

I get a docker image like this:

# docker images

REPOSITORY                    TAG                         IMAGE ID       CREATED         SIZE
triton_trt_llm               latest                      cc73de886a6d   5 hours ago     36GB

Fourth:

I launch the docker image of triton_trt_llm

docker run -idt -p 8250:8000 -p 8251:8001 -p 8252:8002 --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -v /tensorrt:/tensorrtllm_backend triton_trt_llm /bin/sh

Fifth:

In the container, I execute the command for build-tensorrt-llm https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/installation.md#build-tensorrt-llm

Sixth:

I build engines with code-llama-7b

python build.py --model_dir /tensorrtllm_backend/tensorrtllm_backend/CodeLlama-7b-hf/  \
                --dtype float16 \
                --remove_input_padding \
                --use_gpt_attention_plugin float16 \
                --paged_kv_cache \
                --use_inflight_batching \
                --enable_context_fmha \
                --use_gemm_plugin float16 \
                --output_dir /tensorrtllm_backend/tensorrtllm_backend/trt_llama_7b_fp16_kv_cache_inflight_batching_stop/4-gpu/  \
                --vocab_size 32016  \
                --rotary_base 1000000  \
                --max_batch_size 32  \
                --world_size 4 \
                --tp_size 4

Finally:

I call the checkpoint like this, as we can see, the stop_words dose not works.

curl --noproxy '*'  POST localhost:8250/v2/models/ensemble/generate -d '{"text_input": "def quickSort", "max_tokens": 100, "bad_words": "", "stop_words": "quickSort"}'

{
    "model_name":"ensemble",
    "model_version":"1",
    "sequence_end":false,
    "sequence_id":0,
    "sequence_start":false,
    "text_output":"<s> def quickSort(arr):\n    if len(arr) <= 1:\n        return arr\n    else:\n        pivot = arr[0]\n        lesser = [x for x in arr[1:] if x <= pivot]\n        greater = [x for x in arr[1:] if x > pivot]\n        return quickSort(lesser) + [pivot] + quickSort(greater)\n\n\ndef quickSort2(arr):\n   "
}

Add log print in preprocessing's model.py

    def _to_word_list_format(self, word_dict: List[List[str]]):
        '''
        format of word_dict
            len(word_dict) should be same to batch_size
            word_dict[i] means the words for batch i
            len(word_dict[i]) must be 1, which means it only contains 1 string
            This string can contains several sentences and split by ",".
            For example, if word_dict[2] = " I am happy, I am sad", then this function will return
            the ids for two short sentences " I am happy" and " I am sad".
        '''
        assert self.tokenizer != None, "need to set tokenizer"

        if word_dict is None:
            # Return an empty array of shape (1,2,0)
            return np.empty([1, 2, 0], dtype="int32")

        flat_ids = []
        offsets = []
        for word_dict_item in word_dict:
            item_flat_ids = []
            item_offsets = []

            if isinstance(word_dict_item[0], bytes):
                word_dict_item = [word_dict_item[0].decode()]

            words = list(csv.reader(word_dict_item))[0]
            for word in words:
                self.logger.log_info(f"================== preprocessing _to_word_list_format word: {word}")
                ids = self.tokenizer.encode(word)
                self.logger.log_info(f"================== preprocessing _to_word_list_format ids: {ids}")
                if len(ids) == 0:
                    continue

                item_flat_ids += ids
                item_offsets.append(len(ids))

And here is the preprocessing _to_word_list_format ids:

I1114 08:45:41.055726 24910 python_be.cc:1307] model preprocessing, instance preprocessing_0_0, executing 1 requests
I1114 08:45:41.084479 24910 model.py:255] ================== preprocessing _to_word_list_format word: quickSort
I1114 08:45:41.084553 24910 model.py:257] ================== preprocessing _to_word_list_format ids: [1, 4996, 13685]
I1114 08:45:41.084808 24910 infer_response.cc:167] add response output: output: INPUT_ID, type: INT32, shape: [1,4]

Add log print in postprocessing's model.py

    def _postprocessing(self, tokens_batch, sequence_lengths):
        outputs = []
        for batch_idx, beam_tokens in enumerate(tokens_batch):
            for beam_idx, tokens in enumerate(beam_tokens):
                self.logger.log_info(f"================== postprocessing _postprocessing tokens: {tokens}")
                seq_len = sequence_lengths[batch_idx][beam_idx]
                output = self.tokenizer.decode(tokens[:seq_len])
                self.logger.log_info(f"================== postprocessing _postprocessing tokens[:seq_len]: {tokens[:seq_len]}")
                self.logger.log_info(f"================== postprocessing _postprocessing output: {output}")
                outputs.append(output.encode('utf8'))
        return outputs

And here is the preprocessing _to_word_list_format ids:

I1114 08:45:42.255417 24910 model.py:156] ================== postprocessing _postprocessing tokens: [    1   822  4996 13685 29898  2749  1125    13  1678   565  7431 29898
  2749 29897  5277 29871 29896 29901    13  4706   736  3948    13  1678
  1683 29901    13  4706 24438   353  3948 29961 29900 29962    13  4706
  3109   261   353   518 29916   363   921   297  3948 29961 29896 17531
   565   921  5277 24438 29962    13  4706  7621   353   518 29916   363
   921   297  3948 29961 29896 17531   565   921  1405 24438 29962    13
  4706   736  4996 13685 29898  2222   261 29897   718   518 29886 11002
 29962   718  4996 13685 29898  7979  1008 29897    13    13    13  1753
  4996 13685 29906 29898  2749  1125    13  1678]
I1114 08:45:42.255774 24910 model.py:159] ================== postprocessing _postprocessing tokens[:seq_len]: [    1   822  4996 13685 29898  2749  1125    13  1678   565  7431 29898
  2749 29897  5277 29871 29896 29901    13  4706   736  3948    13  1678
  1683 29901    13  4706 24438   353  3948 29961 29900 29962    13  4706
  3109   261   353   518 29916   363   921   297  3948 29961 29896 17531
   565   921  5277 24438 29962    13  4706  7621   353   518 29916   363
   921   297  3948 29961 29896 17531   565   921  1405 24438 29962    13
  4706   736  4996 13685 29898  2222   261 29897   718   518 29886 11002
 29962   718  4996 13685 29898  7979  1008 29897    13    13    13  1753
  4996 13685 29906 29898  2749  1125    13  1678]
I1114 08:45:42.255801 24910 model.py:160] ================== postprocessing _postprocessing output: <s> def quickSort(arr):
    if len(arr) <= 1:
        return arr
    else:
        pivot = arr[0]
        lesser = [x for x in arr[1:] if x <= pivot]
        greater = [x for x in arr[1:] if x > pivot]
        return quickSort(lesser) + [pivot] + quickSort(greater)

def quickSort2(arr):

As we can see, the tokens of stop_words [4996, 13685] appears in the postproceiing's output tokens, but the inference dose not stop early.

activezhao commented 10 months ago

Update:

I add add_special_tokens=False for tokenizer.

I change the code ids = self.tokenizer.encode(word) to ids = self.tokenizer.encode(word, add_special_tokens=False)

And I change the code output = self.tokenizer.decode(tokens[:seq_len]) to output = self.tokenizer.decode(tokens[:seq_len], add_special_tokens=False)

Now, it works!!

I1114 09:26:38.329280 27908 model.py:275] ================== preprocessing _to_word_list_format flat_ids: [array([0.])]
I1114 09:26:38.329402 27908 model.py:276] ================== preprocessing _to_word_list_format offsets: [array([-1.])]
I1114 09:26:38.329457 27908 model.py:256] ================== preprocessing _to_word_list_format word: greater
I1114 09:26:38.329624 27908 model.py:258] ================== preprocessing _to_word_list_format ids: [7621]
I1114 09:26:38.329814 27908 model.py:275] ================== preprocessing _to_word_list_format flat_ids: [array([7621])]
I1114 09:26:38.329909 27908 model.py:276] ================== preprocessing _to_word_list_format offsets: [array([1])]
I1114 09:26:38.330031 27908 model.py:169] ================== preprocessing execute stop_words: [[[7621]
  [   1]]]
curl --noproxy '*'  POST localhost:8250/v2/models/ensemble/generate -d '{"text_input": "def quickSort", "max_tokens": 150, "bad_words": "", "stop_words": "greater"}'

{
    "model_name":"ensemble",
    "model_version":"1",
    "sequence_end":false,
    "sequence_id":0,
    "sequence_start":false,
    "text_output":"<s> def quickSort(arr):\n    if len(arr) <= 1:\n        return arr\n    else:\n        pivot = arr[0]\n        lesser = [x for x in arr[1:] if x <= pivot]\n        greater"
}
curl --noproxy '*'  POST localhost:8250/v2/models/ensemble/generate -d '{"text_input": "def quickSort", "max_tokens": 150, "bad_words": "", "stop_words": "lesser"}'

{
    "model_name":"ensemble",
    "model_version":"1",
    "sequence_end":false,
    "sequence_id":0,
    "sequence_start":false,
    "text_output":"<s> def quickSort(arr):\n    if len(arr) <= 1:\n        return arr\n    else:\n        pivot = arr[0]\n        lesser"
}
activezhao commented 10 months ago

But, there is a question, when the stop_words is "\n", it does not work, the inference will not stop early.

curl --noproxy '*' POST localhost:8250/v2/models/ensemble/generate -d '{"text_input": "def quickSort", "max_tokens": 150, "bad_words": "", "stop_words": "\n"}'

We can see the log and the code, self.logger.log_info(f"================== preprocessing _to_word_list_format word: {word}") is not executed, I guess after words = list(csv.reader(word_dict_item))[0], the words is null.

And the final stop_words is: [[[ 0] [-1]]], the first array [0] should be the token of "\n", such as [13] in llama, and the [-1] should be the offsets of [1], because of this, the stop_words does not work.

I1115 01:31:19.956018 41237 model.py:275] ================== preprocessing _to_word_list_format flat_ids: [array([0.])]
I1115 01:31:19.956226 41237 model.py:276] ================== preprocessing _to_word_list_format offsets: [array([-1.])]
I1115 01:31:19.956481 41237 model.py:275] ================== preprocessing _to_word_list_format flat_ids: [array([0.])]
I1115 01:31:19.956650 41237 model.py:276] ================== preprocessing _to_word_list_format offsets: [array([-1.])]
I1115 01:31:19.956804 41237 model.py:169] ================== preprocessing execute stop_words: [[[ 0]
  [-1]]]
        for word_dict_item in word_dict:
            item_flat_ids = []
            item_offsets = []

            if isinstance(word_dict_item[0], bytes):
                word_dict_item = [word_dict_item[0].decode()]

            words = list(csv.reader(word_dict_item))[0]
            for word in words:
                self.logger.log_info(f"================== preprocessing _to_word_list_format word: {word}")
                ids = self.tokenizer.encode(word)
                self.logger.log_info(f"================== preprocessing _to_word_list_format ids: {ids}")
                if len(ids) == 0:
                    continue

How to resolve this?

mickaelseznec commented 10 months ago

In your query, it looks like \n isn't escaped with quotes for CSV reader to parse it correctly.

It's a bit cumbersome to have all the special characters parsed correctly with bash -> json -> csv. But wouldn't something like '[...] "stop_words": "\"\\n\""}' work?

activezhao commented 10 months ago

In your query, it looks like \n isn't escaped with quotes for CSV reader to parse it correctly.

It's a bit cumbersome to have all the special characters parsed correctly with bash -> json -> csv. But wouldn't something like '[...] "stop_words": "\"\\n\""}' work?

Hi @mickaelseznec I have a question, why we choose to use words = list(csv.reader(word_dict_item))[0]?

I wonder if we can choose to use numpy directly?

Here is the code I changes:

    def _to_word_list_format(self, word_dict: List[List[str]]):
        assert self.tokenizer != None, "need to set tokenizer"
        if word_dict.size == 0:
            # Return an empty array of shape (1,2,0)
            return np.empty([1, 2, 0], dtype="int32")

        flat_ids = []
        offsets = []
        for word_dict_item in word_dict:
            item_flat_ids = []
            item_offsets = []

            if  isinstance(word_dict_item[0], bytes):
                word_dict_item = [item.decode() for item in word_dict_item]

            for word in word_dict_item:
                ids = self.tokenizer.encode(word, add_special_tokens=False)
                if "llama" in str(type(self.tokenizer)) and len(ids) > 0 and ids[0] == 29871:
                    ids = ids[1:]

                if len(ids) == 0:
                    continue

                item_flat_ids += ids
                item_offsets.append(len(ids))

            flat_ids.append(np.array(item_flat_ids))
            offsets.append(np.cumsum(np.array(item_offsets)))

        pad_to = max(1, max(len(ids) for ids in flat_ids))

        for i, (ids, offs) in enumerate(zip(flat_ids, offsets)):
            flat_ids[i] = np.pad(ids, (0, pad_to - len(ids)),
                                 constant_values=0)
            offsets[i] = np.pad(offs, (0, pad_to - len(offs)),
                                constant_values=-1)

        return np.array([flat_ids, offsets], dtype="int32").transpose(
            (1, 0, 2))

I remove the code of words = list(csv.reader(word_dict_item))[0], and add a process for llama.

And I tested some cases.

The stop_words is null:

curl --location 'http://localhost:8000/v2/models/ensemble/generate' --header 'Content-Type: application/json' --data '{
    "text_input": "def quickSort", 
    "max_tokens": 100, 
    "bad_words": "", 
    "stop_words": ""
}'

{
    "model_name":"ensemble",
    "model_version":"1",
    "sequence_end":false,
    "sequence_id":0,
    "sequence_start":false,
    "text_output":"{
        "id":"cmpl-72f80645-a4fd-4746-bc92-d27f9bdbe821",
        "object":"text_completion",
        "created":1700134036,
        "model":"ensemble",
        "choices":[
            {
                "index":0,
                "text":"(arr):\n    if len(arr) <= 1:\n        return arr\n    else:\n        pivot = arr[0]\n        lesser = [x for x in arr[1:] if x <= pivot]\n        greater = [x for x in arr[1:] if x > pivot]\n        return quickSort(lesser) + [pivot] + quickSort(greater)\n\n\ndef quickSort2(arr):\n   ",
                "logprobs":{
                    "text_offset":[

                    ],
                    "token_logprobs":[

                    ],
                    "tokens":[
                        "29898",
                        "2749",
                        "1125",
                        "13",
                        "1678",
                        "565",
                        "7431",
                        "29898",
                        "2749",
                        "29897",
                        "5277",
                        "29871",
                        "29896",
                        "29901",
                        "13",
                        "4706",
                        "736",
                        "3948",
                        "13",
                        "1678",
                        "1683",
                        "29901",
                        "13",
                        "4706",
                        "24438",
                        "353",
                        "3948",
                        "29961",
                        "29900",
                        "29962",
                        "13",
                        "4706",
                        "3109",
                        "261",
                        "353",
                        "518",
                        "29916",
                        "363",
                        "921",
                        "297",
                        "3948",
                        "29961",
                        "29896",
                        "17531",
                        "565",
                        "921",
                        "5277",
                        "24438",
                        "29962",
                        "13",
                        "4706",
                        "7621",
                        "353",
                        "518",
                        "29916",
                        "363",
                        "921",
                        "297",
                        "3948",
                        "29961",
                        "29896",
                        "17531",
                        "565",
                        "921",
                        "1405",
                        "24438",
                        "29962",
                        "13",
                        "4706",
                        "736",
                        "4996",
                        "13685",
                        "29898",
                        "2222",
                        "261",
                        "29897",
                        "718",
                        "518",
                        "29886",
                        "11002",
                        "29962",
                        "718",
                        "4996",
                        "13685",
                        "29898",
                        "7979",
                        "1008",
                        "29897",
                        "13",
                        "13",
                        "13",
                        "1753",
                        "4996",
                        "13685",
                        "29906",
                        "29898",
                        "2749",
                        "1125",
                        "13",
                        "1678"
                    ],
                    "top_logprobs":[

                    ]
                },
                "finish_reason":"length"
            }
        ],
        "usage":{
            "prompt_tokens":4,
            "total_tokens":104,
            "completion_tokens":100
        }
    }"
}

stop_words is "\n"

curl --location 'http://localhost:8000/v2/models/ensemble/generate' --header 'Content-Type: application/json' --data '{
    "text_input": "def quickSort", 
    "max_tokens": 100, 
    "bad_words": "", 
    "stop_words": "\n"
}'

{
    "model_name":"ensemble",
    "model_version":"1",
    "sequence_end":false,
    "sequence_id":0,
    "sequence_start":false,
    "text_output":"{
        "id":"cmpl-081ebbea-fd5e-4604-8a09-d19a561776d1",
        "object":"text_completion",
        "created":1700134187,
        "model":"ensemble",
        "choices":[
            {
                "index":0,
                "text":"(arr):\n",
                "logprobs":{
                    "text_offset":[

                    ],
                    "token_logprobs":[

                    ],
                    "tokens":[
                        "29898",
                        "2749",
                        "1125",
                        "13"
                    ],
                    "top_logprobs":[

                    ]
                },
                "finish_reason":"length"
            }
        ],
        "usage":{
            "prompt_tokens":4,
            "total_tokens":8,
            "completion_tokens":4
        }
    }"
}

More than one words:

If stop_words has more than one words, I use array:

curl --location 'http://localhost:8000/v2/models/ensemble/generate' --header 'Content-Type: application/json' --data '{
    "text_input": "def quickSort", 
    "max_tokens": 100, 
    "bad_words": "", 
    "stop_words": ["greater", "pivot"]
}'

{
    "model_name":"ensemble",
    "model_version":"1",
    "sequence_end":false,
    "sequence_id":0,
    "sequence_start":false,
    "text_output":"{
        "id":"cmpl-2ca50844-b6f4-413f-b986-f8cb400e2428",
        "object":"text_completion",
        "created":1700134388,
        "model":"ensemble",
        "choices":[
            {
                "index":0,
                "text":"(arr):\n    if len(arr) <= 1:\n        return arr\n    else:\n        pivot",
                "logprobs":{
                    "text_offset":[
                    ],
                    "token_logprobs":[

                    ],
                    "tokens":[
                        "29898",
                        "2749",
                        "1125",
                        "13",
                        "1678",
                        "565",
                        "7431",
                        "29898",
                        "2749",
                        "29897",
                        "5277",
                        "29871",
                        "29896",
                        "29901",
                        "13",
                        "4706",
                        "736",
                        "3948",
                        "13",
                        "1678",
                        "1683",
                        "29901",
                        "13",
                        "4706",
                        "24438"
                    ],
                    "top_logprobs":[

                    ]
                },
                "finish_reason":"length"
            }
        ],
        "usage":{
            "prompt_tokens":4,
            "total_tokens":29,
            "completion_tokens":25
        }
    }"
}

What do you think?

Thanks.

mickaelseznec commented 10 months ago

Sure, that makes sense. We’ll add a similar behavior in a next update.

And, keep in mind, the ensemble model is basically an example for people to build upon. You can customize it at will for suiting your needs 🙂

activezhao commented 10 months ago

Sure, that makes sense. We’ll add a similar behavior in a next update.

And, keep in mind, the ensemble model is basically an example for people to build upon. You can customize it at will for suiting your needs 🙂

@mickaelseznec OK, hope it gets better.😎

shatealaboxiaowang commented 9 months ago

Sure, that makes sense. We’ll add a similar behavior in a next update.

And, keep in mind, the ensemble model is basically an example for people to build upon. You can customize it at will for suiting your needs 🙂

hi, dear, has solution when stop_words="\n"? I tried it on latest version and it still didn't work. thank you

activezhao commented 9 months ago

Sure, that makes sense. We’ll add a similar behavior in a next update.

And, keep in mind, the ensemble model is basically an example for people to build upon. You can customize it at will for suiting your needs 🙂

hi, dear, has solution when stop_words="\n"? I tried it on latest version and it still didn't work. thank you

Hi @shatealaboxiaowang u can just try this reply above

https://github.com/triton-inference-server/tensorrtllm_backend/issues/128#issuecomment-1814276748

shatealaboxiaowang commented 9 months ago

Sure, that makes sense. We’ll add a similar behavior in a next update. And, keep in mind, the ensemble model is basically an example for people to build upon. You can customize it at will for suiting your needs 🙂

hi, dear, has solution when stop_words="\n"? I tried it on latest version and it still didn't work. thank you

Hi @shatealaboxiaowang u can just try this reply above

#128 (comment)

thank you,great!

MrD005 commented 9 months ago

@activezhao @shatealaboxiaowang are you getting this same issue ?

233

shatealaboxiaowang commented 8 months ago

In your query, it looks like \n isn't escaped with quotes for CSV reader to parse it correctly.

It's a bit cumbersome to have all the special characters parsed correctly with bash -> json -> csv. But wouldn't something like '[...] "stop_words": "\"\\n\""}' work?

I am wondering why your response contains the following fields: "finish_reason":"length" "usage":{ "prompt_tokens":4, "total_tokens":8, "completion_tokens":4 } How to customize the content and format of the returned field on the server side?

activezhao commented 8 months ago

In your query, it looks like \n isn't escaped with quotes for CSV reader to parse it correctly.

It's a bit cumbersome to have all the special characters parsed correctly with bash -> json -> csv. But wouldn't something like '[...] "stop_words": "\"\\n\""}' work?

I am wondering why your response contains the following fields: "finish_reason":"length" "usage":{ "prompt_tokens":4, "total_tokens":8, "completion_tokens":4 } How to customize the content and format of the returned field on the server side?

@shatealaboxiaowang We just change the model.py file, and add the code of OpenAI format.

shatealaboxiaowang commented 8 months ago

In your query, it looks like \n isn't escaped with quotes for CSV reader to parse it correctly. It's a bit cumbersome to have all the special characters parsed correctly with bash -> json -> csv. But wouldn't something like '[...] "stop_words": "\"\\n\""}' work?

I am wondering why your response contains the following fields: "finish_reason":"length" "usage":{ "prompt_tokens":4, "total_tokens":8, "completion_tokens":4 } How to customize the content and format of the returned field on the server side?

@shatealaboxiaowang We just change the model.py file, and add the code of OpenAI format.

Thank you for your reply, I am in the postprocessing/1/model.py and change the source code like this: inference_response = pb_utils.InferenceResponse(output_tensors=[ output_tensor, out_cum_log_probs, out_output_log_probs, out_sequence_lengths ]) increases the return field (out_sequence_lengths) in the inference_response, but it doesn't take effect, you can tell me how you change openai format,In which model.py file did you make the change and how did you change the source code?