neuralmagic / deepsparse

Sparsity-aware deep learning inference runtime for CPUs
https://neuralmagic.com/deepsparse/
Other
2.94k stars 169 forks source link

[OpenAI] Add logprob support #1522

Closed dsikka closed 5 months ago

dsikka commented 6 months ago

Summary

Testing / Question?

Local Testing

num_cores: 2
num_workers: 2
integration: openai
endpoints:
  - task: text_generation
    model: "hf:mgoin/TinyStories-1M-ds"

Client Code:

import openai
from openai import OpenAI

client = OpenAI(base_url="http://localhost:5543/v1", api_key="EMPTY")

models = client.models.list()

model = "hf:mgoin/TinyStories-1M-ds"
print(f"Accessing model API '{model}'")

# Completion API
stream = True
completion = client.completions.create(
    prompt="The dog",
    max_tokens=10,
    stream=stream,
    model=model,
    logprobs=True
)

for c in completion:
    print(c)

Output:

Accessing model API 'hf:mgoin/TinyStories-1M-ds'
Completion(id='cmpl-67b04e6e35a84efd9950438d47c1794c', choices=[CompletionChoice(finish_reason=None, index=None, logprobs=Logprobs(text_offset=[], token_logprobs=[-1.5180178880691528], tokens=[' very'], top_logprobs=None), text=' very')], created=1705011569, model='/home/dsikka/.cache/huggingface/hub/models--mgoin--TinyStories-1M-ds/snapshots/eba4f1f16d00041d78c8f8a3bba5d3fa5afe5513/model.onnx', object='text_completion', system_fingerprint=None, usage=None)
Completion(id='cmpl-67b04e6e35a84efd9950438d47c1794c', choices=[CompletionChoice(finish_reason=None, index=None, logprobs=Logprobs(text_offset=[], token_logprobs=[-1.396243691444397], tokens=[' scared'], top_logprobs=None), text=' scared')], created=1705011569, model='/home/dsikka/.cache/huggingface/hub/models--mgoin--TinyStories-1M-ds/snapshots/eba4f1f16d00041d78c8f8a3bba5d3fa5afe5513/model.onnx', object='text_completion', system_fingerprint=None, usage=None)
Completion(id='cmpl-67b04e6e35a84efd9950438d47c1794c', choices=[CompletionChoice(finish_reason=None, index=None, logprobs=Logprobs(text_offset=[], token_logprobs=[-0.6142183542251587], tokens=['.'], top_logprobs=None), text='.')], created=1705011569, model='/home/dsikka/.cache/huggingface/hub/models--mgoin--TinyStories-1M-ds/snapshots/eba4f1f16d00041d78c8f8a3bba5d3fa5afe5513/model.onnx', object='text_completion', system_fingerprint=None, usage=None)
Completion(id='cmpl-67b04e6e35a84efd9950438d47c1794c', choices=[CompletionChoice(finish_reason=None, index=None, logprobs=Logprobs(text_offset=[], token_logprobs=[-0.7276340126991272], tokens=[' He'], top_logprobs=None), text=' He')], created=1705011569, model='/home/dsikka/.cache/huggingface/hub/models--mgoin--TinyStories-1M-ds/snapshots/eba4f1f16d00041d78c8f8a3bba5d3fa5afe5513/model.onnx', object='text_completion', system_fingerprint=None, usage=None)
Completion(id='cmpl-67b04e6e35a84efd9950438d47c1794c', choices=[CompletionChoice(finish_reason=None, index=None, logprobs=Logprobs(text_offset=[], token_logprobs=[-2.1691243648529053], tokens=[' wanted'], top_logprobs=None), text=' wanted')], created=1705011569, model='/home/dsikka/.cache/huggingface/hub/models--mgoin--TinyStories-1M-ds/snapshots/eba4f1f16d00041d78c8f8a3bba5d3fa5afe5513/model.onnx', object='text_completion', system_fingerprint=None, usage=None)
Completion(id='cmpl-67b04e6e35a84efd9950438d47c1794c', choices=[CompletionChoice(finish_reason=None, index=None, logprobs=Logprobs(text_offset=[], token_logprobs=[-0.06524639576673508], tokens=[' to'], top_logprobs=None), text=' to')], created=1705011569, model='/home/dsikka/.cache/huggingface/hub/models--mgoin--TinyStories-1M-ds/snapshots/eba4f1f16d00041d78c8f8a3bba5d3fa5afe5513/model.onnx', object='text_completion', system_fingerprint=None, usage=None)
Completion(id='cmpl-67b04e6e35a84efd9950438d47c1794c', choices=[CompletionChoice(finish_reason=None, index=None, logprobs=Logprobs(text_offset=[], token_logprobs=[-2.3006293773651123], tokens=[' go'], top_logprobs=None), text=' go')], created=1705011569, model='/home/dsikka/.cache/huggingface/hub/models--mgoin--TinyStories-1M-ds/snapshots/eba4f1f16d00041d78c8f8a3bba5d3fa5afe5513/model.onnx', object='text_completion', system_fingerprint=None, usage=None)
Completion(id='cmpl-67b04e6e35a84efd9950438d47c1794c', choices=[CompletionChoice(finish_reason=None, index=None, logprobs=Logprobs(text_offset=[], token_logprobs=[-1.681646466255188], tokens=[' back'], top_logprobs=None), text=' back')], created=1705011569, model='/home/dsikka/.cache/huggingface/hub/models--mgoin--TinyStories-1M-ds/snapshots/eba4f1f16d00041d78c8f8a3bba5d3fa5afe5513/model.onnx', object='text_completion', system_fingerprint=None, usage=None)
Completion(id='cmpl-67b04e6e35a84efd9950438d47c1794c', choices=[CompletionChoice(finish_reason=None, index=None, logprobs=Logprobs(text_offset=[], token_logprobs=[-0.7840849757194519], tokens=[' to'], top_logprobs=None), text=' to')], created=1705011569, model='/home/dsikka/.cache/huggingface/hub/models--mgoin--TinyStories-1M-ds/snapshots/eba4f1f16d00041d78c8f8a3bba5d3fa5afe5513/model.onnx', object='text_completion', system_fingerprint=None, usage=None)
mgoin commented 6 months ago

in between some concurrent work between deepsparse and lm-eval to test the openai server, i ran into an issue that top_logprobs isn’t implemented here is the vLLM create_logprobs function that i tried hacking into your diff but got stuck without having the top_logprobs input: https://github.com/vllm-project/vllm/blob/827cbcd37c464452b79956fa4a564199e6c0ab6a/vllm/entrypoints/openai/api_server.py#L208-L240C20

here is the request, response, and traceback i got within lm-eval-harness. commands:

deepsparse.server --integration openai --task text-generation --model_path hf:mgoin/TinyStories-1M-ds
lm_eval --model local-completions --model_args base_url=http://localhost:5543/v1,model=hf:mgoin/TinyStories-1M-ds,tokenizer_backend=huggingface,tokenizer=mgoin/TinyStories-1M-ds --tasks hellaswag --num_fewshot 0
REQUEST = {'model': 'hf:mgoin/TinyStories-1M-ds', 'prompt': [[41183, 290, 14620, 25, 1374, 284, 910, 23748, 287, 410, 1155, 22678, 13, 13816, 366, 2124, 259, 442, 24247, 78, 366, 355, 257, 2276, 31933, 13, 1002, 345, 691, 2193, 530, 410, 1155, 22678, 31933, 11, 366, 2124, 259, 442, 24247, 78, 366, 561, 1884, 307, 262, 1266, 31933, 284, 3853, 13, 350, 1313, 8652, 366, 2124, 259, 442, 24247, 78, 366, 355, 25, 7813, 474, 322, 262, 1573, 366, 442, 24247, 78, 366, 1724, 366, 23748, 366, 287, 46932, 11, 475, 345, 561, 8365, 779, 340, 3436, 13, 220, 17106, 11, 12581, 6428, 366, 627, 2634, 366, 1724, 366, 23748, 366, 287, 410, 1155, 22678, 11, 475, 345, 743, 635, 779, 340, 355, 366, 627, 2634, 13, 366, 329, 46932, 11636, 11, 340, 318, 16293, 366, 1575, 64, 2356, 64, 442, 24247, 78, 12, 421, 2634, 13]], 'echo': True, 'max_tokens': 0, 'temperature': 0.0, 'logprobs': 10, 'seed': 1234}

2024-01-12:21:52:50,368 INFO     [_client.py:1027] HTTP Request: POST http://localhost:5543/v1/completions "HTTP/1.1 200 OK"

RESPONSE = CompletionChoice(finish_reason='stop', index=None, logprobs=Logprobs(text_offset=[], token_logprobs=[-1.3718655109405518, -1.6813075542449951, -0.5835205912590027, -2.3857421875, -1.1850147247314453, -1.4327609539031982, -1.0862425565719604, -1.3265399932861328, -0.01401725597679615, -0.7671912908554077, -1.8408502340316772, -0.02035493403673172, -0.805853545665741, -1.3963234424591064, -0.6728101372718811, -0.28462377190589905, -0.18176312744617462, -1.6125370264053345, -0.8710795640945435, -1.1399714946746826, -0.4823414087295532, -1.6222418546676636, -1.6073075532913208, -0.9253568649291992, -0.28968146443367004, -2.509056568145752, -1.207929015159607, -1.9377094507217407, -0.39531823992729187, -0.5303062200546265, -1.6694589853286743, -0.24369390308856964, -1.382930040359497, -0.42184001207351685, -2.2233667373657227, -0.8875962495803833, -1.6202012300491333, -1.9755644798278809, -2.335638999938965, -1.3909152746200562, -2.493443489074707, -0.8141875267028809, -1.8022541999816895, -1.3976320028305054, -2.528510570526123, -0.4395048916339874, -0.8453584313392639, -0.008289567194879055, -0.012714286334812641, -1.3391444683074951, -0.009850701317191124, -1.1681102514266968, -1.4828898906707764, -1.0213128328323364, -0.8255200982093811, -1.1556187868118286, -2.5673587322235107, -0.07395735383033752, -2.212928295135498, -0.856827974319458, -1.0892446041107178, -0.6976301074028015, -0.41025859117507935, -2.5871167182922363, -0.6402055025100708, -1.857731580734253, -0.3513781726360321, -0.28716686367988586, -0.880348801612854, -2.105755090713501, -0.255537211894989, -2.720140218734741, -0.19674457609653473, -1.0769606828689575, -1.0204941034317017, -0.6762701869010925, -0.1618189811706543, -1.2297614812850952, -1.4207383394241333, -1.2299003601074219, -0.49282366037368774, -1.246532917022705, -1.6099696159362793, -2.2601776123046875, -1.465539813041687, -0.001021907082758844, -2.3649849891662598, -0.3639565706253052, -0.93106609582901, -0.9325934052467346, -1.520829439163208, -0.8304876089096069, -0.01390259712934494, -0.010546673089265823, -1.4068495035171509, -0.8410376310348511, -0.032891422510147095, -0.011724268086254597], tokens=[' You', ' can', "'t", ' take', ' it', ' away', '".', ' ', '\n', '\n', 'Sam', 'my', ' was', ' sad', ',', ' but', ' he', ' knew', ' he', ' had', ' to', ' be', ' careful', '.', ' He', ' said', ' "', 'No', ',', ' I', ' can', "'t", ' do', ' it', '.', ' I', ' will', ' be', ' careful', ' and', ' try', ' to', ' make', ' it', ' better', '."', ' ', '\n', '\n', 'Sam', 'my', ' was', ' very', ' sad', ' and', ' he', ' decided', ' to', ' take', ' a', ' break', '.', ' He', ' put', ' the', ' band', 'age', ' on', ' the', ' floor', ' and', ' put', ' it', ' in', ' his', ' pocket', '.', ' He', ' was', ' so', ' happy', ' and', ' he', ' was', ' able', ' to', ' play', ' with', ' the', ' band', '.', ' ', '\n', '\n', 'The', ' end', '.', '\n'], top_logprobs=None), text=' You can\'t take it away". \n\nSammy was sad, but he knew he had to be careful. He said "No, I can\'t do it. I will be careful and try to make it better." \n\nSammy was very sad and he decided to take a break. He put the bandage on the floor and put it in his pocket. He was so happy and he was able to play with the band. \n\nThe end.\n')

Traceback (most recent call last):
  File "/home/mgoin/venvs/clip-ret/bin/lm_eval", line 8, in <module>
    sys.exit(cli_evaluate())
  File "/home/mgoin/code/lm-evaluation-harness-mgoin/lm_eval/__main__.py", line 231, in cli_evaluate
    results = evaluator.simple_evaluate(
  File "/home/mgoin/code/lm-evaluation-harness-mgoin/lm_eval/utils.py", line 415, in _wrapper
    return fn(*args, **kwargs)
  File "/home/mgoin/code/lm-evaluation-harness-mgoin/lm_eval/evaluator.py", line 150, in simple_evaluate
    results = evaluate(
  File "/home/mgoin/code/lm-evaluation-harness-mgoin/lm_eval/utils.py", line 415, in _wrapper
    return fn(*args, **kwargs)
  File "/home/mgoin/code/lm-evaluation-harness-mgoin/lm_eval/evaluator.py", line 325, in evaluate
    resps = getattr(lm, reqtype)(cloned_reqs)
  File "/home/mgoin/code/lm-evaluation-harness-mgoin/lm_eval/models/openai_completions.py", line 201, in loglikelihood
    return self._loglikelihood_tokens(new_reqs)
  File "/home/mgoin/code/lm-evaluation-harness-mgoin/lm_eval/models/openai_completions.py", line 248, in _loglikelihood_tokens
    answer = get_result(resp, ctxlen)
  File "/home/mgoin/code/lm-evaluation-harness-mgoin/lm_eval/models/openai_completions.py", line 35, in get_result
    top_tokens = response.logprobs.top_logprobs[i]
TypeError: 'NoneType' object is not subscriptable
dsikka commented 6 months ago

This is ready for merge apart from support for the field top_logprob. Need clarification on what we're actually returning for this field

dsikka commented 6 months ago

@mgoin When you're happy with this and it passes any tests you have, let me know and we can merge this in. Also, if the tests you're running locally are worth adding to the openai_server tests, I can do that as well as part of this PR