neuralmagic / deepsparse

Sparsity-aware deep learning inference runtime for CPUs
https://neuralmagic.com/deepsparse/
Other
2.97k stars 171 forks source link

Correct token usage reporting for OpenAI server #1496

Closed mgoin closed 8 months ago

mgoin commented 8 months ago

The OpenAI server was always returning "usage":{"prompt_tokens":2,"total_tokens":4,"completion_tokens":2} since it was taking the length of the token_ids dictionary and not one of its members

final_res.prompt_token_ids {'input_ids': [1639, 389, 257, 7613], 'attention_mask': [1, 1, 1, 1]}

Test

Server command:

deepsparse.server --integration openai --task text-generation --model_path hf:mgoin/TinyStories-1M-ds

Client command:

curl http://localhost:5543/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer dummy" \
  -d '{
    "model": "hf:mgoin/TinyStories-1M-ds",
    "messages": "You are a helpful assistant."
  }'

Before:

{"id":"cmpl-10b937d72d2b4cf382da0c6e54241814","object":"chat.completion","created":1703173842,"model":"hf:mgoin/TinyStories-1M-ds","choices":[{"message":{"role":"assistant","content":" You can make a big house for the house and the house. You can make"},"finish_reason":"length"}],"usage":{"prompt_tokens":2,"total_tokens":4,"completion_tokens":2}}

After:

{"id":"cmpl-85104f94d3294b88b2aae34095cf204d","object":"chat.completion","created":1703173678,"model":"hf:mgoin/TinyStories-1M-ds","choices":[{"message":{"role":"assistant","content":" You can make a big house for the house and the house. You can make"},"finish_reason":"length"}],"usage":{"prompt_tokens":6,"total_tokens":22,"completion_tokens":16}}
dsikka commented 8 months ago

Good catch