triton-inference-server / tensorrtllm_backend

The Triton TensorRT-LLM Backend
Apache License 2.0
588 stars 81 forks source link

Infer failed: Unable to parse 'data': Shape does not match true shape of 'data' field in generate endpoint #369

Open bprus opened 3 months ago

bprus commented 3 months ago

System Info

Who can help?

No response




I follow official examples for Llama model: I'm able to set everything up, and everything runs smoothly when using the ensemble model:

curl -X POST localhost:8000/v2/models/ensemble/generate -d '{"text_input": "What is", "max_tokens": 1000}'

and the response is:

{"context_logits":0.0,"cum_log_probs":0.0,"generation_logits":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,...,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"the purpose of the meeting? What are the key issues to be discussed? What are the desired outcomes or decisions to be made?\n\n2. Identify the key stakeholders: Who are the key people that need to be involved in the meeting? What are their roles and responsibilities? What are their interests and perspectives?\n\n3. Determine the meeting format: Will the meeting be formal or informal? Will it be a presentation-style meeting or a discussion-style meeting? What is the appropriate level of formality and structure for the meeting?\n\n4. Choose a suitable location: Where will the meeting be held? Is the location easily accessible and comfortable for all attendees?\n\n5. Establish a clear agenda: What specific topics will be discussed during the meeting? What are the desired outcomes or decisions to be made? What are the key points to be covered?\n\n6. Set a time limit: How long will the meeting last? What is the appropriate length of time for the meeting?\n\n7. Identify any necessary materials: What materials or information will be needed during the meeting? Will any presentations or handouts be needed?\n\n8. Choose a suitable time: What is the best time for the meeting? Will all attendees be available at that time?\n\n9. Establish a clear communication plan: How will the meeting be conducted? Will it be in person, via video conference, or via phone? What is the appropriate communication method for the meeting?\n\n10. Identify any necessary follow-up actions: What actions need to be taken after the meeting? Who is responsible for taking these actions? What are the timelines for these actions?\n\nBy following these steps, you can ensure that your meetings are well-planned, productive, and effective."}

I'm also able to run preprocessing model:

curl -X POST localhost:8000/v2/models/preprocessing/generate -d '{"QUERY": "What is", "REQUEST_OUTPUT_LEN": 1000}'

and the response is:


Then, when I try to query the tensorrt_llm model directly:

curl -i -X POST localhost:8000/v2/models/tensorrt_llm/generate -d '{"input_ids": [1724, 338], "input_lengths": 2, "request_output_len": 1000}'

I get error:

{"error":"Unable to parse 'data': Shape does not match true shape of 'data' field"}

Triton runs with debug logs on and there is no more information there:

I0308 13:45:35.029834 96] HTTP request: 2 /v2/models/tensorrt_llm/generate
I0308 13:45:35.029876 96] GetModel() 'tensorrt_llm' version -1
I0308 13:45:35.029884 96] VersionStates() 'tensorrt_llm'
I0308 13:45:35.029980 96] GetModel() 'tensorrt_llm' version -1
I0308 13:45:35.030004 96] [request id: <id_unknown>] Infer failed: Unable to parse 'data': Shape does not match true shape of 'data' field

I tried many different request versions, trying to wrap values in lists, etc. without any success.

What I found is that it works if input_ids is one element:

curl -i -X POST localhost:8000/v2/models/tensorrt_llm/generate -d '{"input_ids": [1724], "input_lengths": 1, "request_output_len": 1000}'

with response:


Moreover, I'm able to query the infer endpoint successfully, like:

curl -i -X POST localhost:8000/v2/models/tensorrt_llm/infer -d \
'{"inputs": [{"name" : "input_ids", "shape" : [ 1, 2 ], "datatype" : "INT32", "data" : [1724,338] }, {"name" : "input_lengths", "shape" : [1, 1], "datatype" : "INT32", "data" : [2] }, {"name" : "request_output_len", "shape" : [1, 1], "datatype" : "INT32", "data" : [1000] }]}'

with response:


I guess it's something simple and I'm querying the endpoint in a wrong way, but I really can't find a solution. Any help would be appreciated.

Expected behavior

generate endpoint returns correct results. Error message is more meaningful.

actual behavior

generate endpoint throws an error.

additional notes
