Open ephraimrothschild opened 2 months ago
I just took a quick look and I'm not seeing any code in our vllm gRPC backend to pass through token counts.
Thank you for the heads up - this is a bug we'll want to patch
this is not implemented indeed. definitely something we want to add :+1:
LocalAI version:
localai/localai:v2.20.1-cublas-cuda12
Environment, CPU architecture, OS, and Version:
Linux dev-box 6.8.0-41-generic #41-Ubuntu SMP PREEMPT_DYNAMIC Fri Aug 2 20:41:06 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Describe the bug
When making calls to both the
/chat/completions
and/completions
endpoints, models backed with vLLM do not count tokens correctly, and are reporting that no tokens were used - despite correctly completing the prompt. This is not an issue with vLLM itself, since running the exact same model using vLLM's provided OpenAI server docker image correctly returns the actual token counts of the response.To Reproduce
What Works (vLLM direct) First, we can show the correct behavior coming from vLLM:
Run vLLM with the model:
Then send the a request to
http://localhost:8000/v1/chat/completions
with the following body:Note that we get the following response
Which contains correct usage data about the response.
What Doesn't Work (vLLM via LocalAI) Now we'll try the same model, with the same configurations but running through localAI instead of directly through vLLM.
http://localhost:8000/v1/chat/completions
):4: However now, notice the response contains all 0s for usage data:
Expected behavior The response from the vLLM server, and the localAI server running a vLLM backend should be identical - and specifically localAI's usage data should be correct. However it is instead providing all 0s for usage despite not having an empty response.
Logs
Additional context
This issue only happens on vLLM backed models. It does not happen when - for example - we run the same model on localAI with a llama.cpp backend.