triton-inference-server / tensorrtllm_backend

The Triton TensorRT-LLM Backend
Apache License 2.0
581 stars 81 forks source link

tensorrt_llm_bls disregards top_k / temperature setting #472

Open janpetrov opened 1 month ago

janpetrov commented 1 month ago

System Info

Triton + TRT-LLM 0.9.0, llama2 70b model, fp8 quantization, run on 2xH100 80GB, tp 2, pp 1 config.pbtxt for tensorrt_llm_bls (otherwise unchanged):

parameters: {
  key: "accumulate_tokens"
  value: {
    string_value: "true"
  }
}
parameters: {
  key: "tensorrt_llm_model_name"
  value: {
    string_value: "tensorrt_llm"
  }
}

Who can help?

No response

Information

Tasks

Reproduction

curl -X POST localhost:8000/v2/models/tensorrt_llm_bls/generate -d '{"text_input": "What is machine learning?", "max_tokens": 20, "bad_words": "", "stop_words": "", "temperature": 100.0, "top_k": 100}'

output

{"context_logits":0.0,"cum_log_probs":0.0,"generation_logits":0.0,"model_name":"tensorrt_llm_bls","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],"text_output":"- Quora\nMachine learning is a field of artificial intelligence which enables machines to learn without being specifically"}

Expected behavior

given temperature 100.0 and top_k 100, one would expect a nonsensical (and not the canonical) answer

actual behavior

see above, the reproduction part

additional notes

Ensemble model works as expected. I have sent the following request to the same running engine just a few seconds after the tensorrt_llm_bls request above:

curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "What is machine learning?", "max_tokens": 20, "bad_words": "", "stop_words": "", "temperature": 100.0, "top_k": 100}'

output:

{"model_name":"ensemble","model_version":"1","sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"Crussischroo-Nau ™ Technocrord! Evaluuj poddano"}