Open jfpichlme opened 9 months ago
We are experiencing the same issue.
For now I am using a workaround that is probably not ideal. In the postprocessing script (/postprocessing/1/model.py) I changed the _postprocessing function to return the actual token ids.
def _postprocessing(self, tokens_batch, sequence_lengths):
outputs = []
for batch_idx, beam_tokens in enumerate(tokens_batch):
for beam_idx, tokens in enumerate(beam_tokens):
seq_len = sequence_lengths[batch_idx][beam_idx]
output = tokens[:seq_len]
outputs.append(output)
return outputs
I collect all the Token Ids on the User side and then decode the entire sequence which produces the correct output.
The tokenizers in transoformers do not support this function automatically when calling decode function
The standard way of going about this is holding tokens in cache until a space is detected, in which everything after the space is put again into cache.
The other suggested method decodes the tokenid text instead of the string text to look for a "" symbol
here is a work around with text using the second method
def _postprocessing(self, tokens_batch, sequence_lengths):
outputs = []
for batch_idx, beam_tokens in enumerate(tokens_batch):
for beam_idx, tokens in enumerate(beam_tokens):
seq_len = sequence_lengths[batch_idx][beam_idx]
output = self.tokenizer.decode(
tokens[:seq_len],
skip_special_tokens=False)
# for streamming mode
token_id_string = self.tokenizer.convert_ids_to_tokens(tokens[:seq_len],skip_special_tokens=True)[0]
if token_id_string[0] == "▁":
output = " " + output
outputs.append(output.encode('utf8'))
return str(output)
@Shixiaowei02 I can create a PR for this
Have you tried the tensorrt_llm_bls
module?
btw @jfpichlme
how did you get tensorrt-LLM working witht he new workflow specivly with the trtllm-build
command? which docker command and version of tensortllm_backend did you use?
Hi enochlev,
I have used Option 2 in the tensorrt-llm backend repo to build the docker container:
# Update the submodules
cd tensorrtllm_backend
git lfs install
git submodule update --init --recursive
# Use the Dockerfile to build the backend in a container
# For x86_64
DOCKER_BUILDKIT=1 docker build -t triton_trt_llm -f dockerfile/Dockerfile.trt_llm_backend .
The docker version is: "22.04" and the tensorrt_llm git version is TensorRT-LLM backend (#324).
This process now consists of two steps, first a covert_checkpoint step. Then a build step.
Perform the Conversion step The conversion step is done via the following command:
python convert_checkpoint.py --model_dir $Enter hugginface format model dir \
--output_dir ./tllm_checkpoint_1gpu_fp16 \
--dtype float16 \
--tp_size 8
Perform the Build step:
trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_fp16 \
--output_dir $Put your directory here \
--gpt_attention_plugin float16 \
--gemm_plugin float16 \
--remove_input_padding enable \
--paged_kv_cache enable \
--enable_xqa enable \
--paged_kv_cache enable
--max_batch_size 300
Last step is to copy the created engine files to the tensorrt_llm/1/ directory and adapt the config files. You can see the configs of the model in my initial comment.
I hope this helps you. @byshiue I will test the tensorrt_llm_bls module now.
@jfpichlme any luck with bls + streaming? I have the same problem and for some reason can't make my grpc client to work with bls.
Hi ekarmazin, bls + streaming did not work for me. At the moment I am sticking to the proposed solution where I modify the postprocessing script and buffer the token on the user side. To sort of augment streaming (displaying word by word output), I am decoding around 6-10 Token at a time. However, this does not work perfectly all the time.
@jfpichlme I kind of got it working with BLS, it does proper output with whitespaces now. But I faced an accuracy problems with enabling --use_paged_context_fmha
but that is a different issue.
@byshiue same issue with bls model. Spaces are presented when accumulate tokens are true, and missing when false.
@enochlev apologies for the delayed response. Would you still be able to PR the fix you suggested?
Any update on this?
I will find some time around work this week and push an update
mark
Mark
Any update?
Mark
Mark
Mark
The tokenizers in transoformers do not support this function automatically when calling decode function
The standard way of going about this is holding tokens in cache until a space is detected, in which everything after the space is put again into cache.
The other suggested method decodes the tokenid text instead of the string text to look for a "" symbol
here is a work around with text using the second method
def _postprocessing(self, tokens_batch, sequence_lengths): outputs = [] for batch_idx, beam_tokens in enumerate(tokens_batch): for beam_idx, tokens in enumerate(beam_tokens): seq_len = sequence_lengths[batch_idx][beam_idx] output = self.tokenizer.decode( tokens[:seq_len], skip_special_tokens=False) # for streamming mode token_id_string = self.tokenizer.convert_ids_to_tokens(tokens[:seq_len],skip_special_tokens=True)[0] if token_id_string[0] == "▁": output = " " + output outputs.append(output.encode('utf8')) return str(output)
@enochlev crash if the last token is EOS, a quick fix:
token_id_string = self.tokenizer.convert_ids_to_tokens(tokens[:seq_len], skip_special_tokens=True)
if len(token_id_string) > 0 and len(token_id_string[0]) > 0 and token_id_string[0][0] == "▁":
output = " " + output
@elinx Really appreciate catching that...
I just submitted a PR including your suggestion. It worked it my local environment before a submitted the PR, so it has my approval (if that means anything 😁)
System Info
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
1. Set up LLama2 (7b, 13b, 70b) in streaming mode:
model_config:
preprocessing:
postprocessing:
ensemble:
2. Use Nvidia client notebook (Install does not work, but downloading langchain_nvidia_trt.llms directly solves the problem)
(I have also written my own grpc client which produces the same output)
3. Send inference request via grpc to the triton
Expected behavior
Produce output tokens including whitespace:
actual behavior
Triton produces output tokens without whitespace:
additional notes
I am not too sure if this is a bug or that I am missing some flag. Any help is highly appreciated
Model build: