triton-inference-server / tensorrtllm_backend

The Triton TensorRT-LLM Backend
Apache License 2.0
581 stars 81 forks source link

How to post sample parameters (like top_k, temperature) for triton http server #436

Closed wanzhenchn closed 1 month ago

wanzhenchn commented 2 months ago

System Info

GPU A30 (32GB)

Who can help?

@byshiue @schetlur-nv

Information

Tasks

Reproduction

How to post top_k and temperature for triton http server? The response is the same as input prompt.

import request

response = requests.post(
        url="http://localhost:8000/v2/models/ensemble/generate",
        data=json.dumps({
            "max_tokens": 128,
            "top_p": 0.95
            "top_k": 3,
            "temperature":0.7,
            "stream": True,
            "text_input": "what is the machine learning?"
            }),
            stream=False
        )

Expected behavior

No

actual behavior

No

additional notes

No

byshiue commented 1 month ago

Hi. I don't get your point. Could you explain more? It is better to share the end to end reproduced steps.

wanzhenchn commented 1 month ago

Hi. I don't get your point. Could you explain more? It is better to share the end to end reproduced steps.

How to post sample parameters: top_k and temperature to data JSON dict of requests.post() for triton server?

I found that the server only accepts top_p and repetition_penalty, if I forcefully pass in top_k and temperature, the output results will be the same as the input prompt. @byshiue

import request

response = requests.post(
        url="http://localhost:8000/v2/models/ensemble/generate",
        data=json.dumps({
            "max_tokens": 128,
            "top_p": 0.95,
            "repetition_penalty": 1.15,
            "stream": True,
            "text_input": "what is the machine learning?"
            }),
            stream=False
        )
byshiue commented 1 month ago

I have test different parameters on llama model and get different results:

$ curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "What is machine learning?", "max_tokens": 20, "bad_words": "", "stop_words": "", "pad_id": 2, "end_id": 2}'
{"context_logits":0.0,"cum_log_probs":0.0,"generation_logits":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"Machine learning is a subfield of artificial intelligence that focuses on the development of algorithms that can learn"}

$ curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "What is machine learning?", "max_tokens": 20, "bad_words": "", "stop_words": "", "pad_id": 2, "end_id": 2, "top_k": 16}'
{"context_logits":0.0,"cum_log_probs":0.0,"generation_logits":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"The answer to this question depends on who you ask and when it was asked.\nIt has its"}

$ curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "What is machine learning?", "max_tokens": 20, "bad_words": "", "stop_words": "", "pad_id": 2, "end_id": 2, "top_k": 16, "repetition_penalty": 2.0}'
{"context_logits":0.0,"cum_log_probs":0.0,"generation_logits":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"Why do we need it and how can you use in your applications\nThe term “machine” often"}

You could check that do you really pass the parameters correctlly. We have supported python backend now and you could switch to python backend by changing https://github.com/triton-inference-server/tensorrtllm_backend/blob/ae52bce3ed8ecea468a16483e0dacd3d156ae4fe/all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt#L28 to

backend: "python"

and add debug messages at https://github.com/triton-inference-server/tensorrtllm_backend/blob/ae52bce3ed8ecea468a16483e0dacd3d156ae4fe/all_models/inflight_batcher_llm/tensorrt_llm/1/model.py#L120 to check the input parameters.

wanzhenchn commented 1 month ago

Many thanks for yore response.

@byshiue When I pass the top_p and repetition_penalty, the error occurred:

{"error":"failed to parse the request JSON buffer: Missing a name for object member. at 196"}

image

wanzhenchn commented 1 month ago

Removing the comma in {} solved the problem.

image