neuralmagic / deepsparse

Sparsity-aware deep learning inference runtime for CPUs
https://neuralmagic.com/deepsparse/
Other
2.97k stars 171 forks source link

[server] Update OpenAI endpoints #1445

Closed dsikka closed 9 months ago

dsikka commented 9 months ago

Summary

Testing

from openai import OpenAI

client = OpenAI(base_url="http://localhost:5543/v1", api_key="EMPTY")

models = client.models.list()

model = "hf:neuralmagic/mpt-7b-chat-pruned50-quant"
print(f"Accessing model API '{model}'")

# Completion API
stream = True
completion = client.chat.completions.create(
    messages={"role": "user", "content": "Talk about the Toronto Raptors."},
    stream=stream,
    max_tokens=100,
    model=model,
)

print("Chat results:")
if stream:
    text = ""
    for c in completion:
        print(c)
else:
    print(completion)

stream = True
completion = client.completions.create(
    prompt="How are you today?",
    stream=stream,
    max_tokens=100,
    model=model,
)

print("Completion results:")
if stream:
    text = ""
    for c in completion:
        print(c)
else:
    print(completion)
mgoin commented 9 months ago

I ran through the script using hf:neuralmagic/TinyLlama-1.1B-Chat-v0.4-pruned50-quant-ds as the model and installing pip install fschat accelerate

Looks like something about the last message handshake went wrong.

client.txt

ChatCompletionChunk(id='cmpl-c735b32f15c043b49893cd6a0ac7ab96', choices=[Choice(delta=ChoiceDelta(content='', function_call=None, role=None, tool_calls=None), finish_reason='length', index=0)], created=1701898636, model='hf:neuralmagic/TinyLlama-1.1B-Chat-v0.4-pruned50-quant-ds', object='chat.completion.chunk', system_fingerprint=None)
httpx.RemoteProtocolError: peer closed connection without sending complete message body (incomplete chunked read)

server.txt

  File "/Users/mgoin/code/deepsparse/src/deepsparse/server/openai_server.py", line 159, in abort_request
    await pipeline.abort(request_id)
AttributeError: 'TextGenerationPipeline' object has no attribute 'abort'
dsikka commented 9 months ago

I ran through the script using hf:neuralmagic/TinyLlama-1.1B-Chat-v0.4-pruned50-quant-ds as the model and installing pip install fschat accelerate

Looks like something about the last message handshake went wrong.

client.txt

ChatCompletionChunk(id='cmpl-c735b32f15c043b49893cd6a0ac7ab96', choices=[Choice(delta=ChoiceDelta(content='', function_call=None, role=None, tool_calls=None), finish_reason='length', index=0)], created=1701898636, model='hf:neuralmagic/TinyLlama-1.1B-Chat-v0.4-pruned50-quant-ds', object='chat.completion.chunk', system_fingerprint=None)
httpx.RemoteProtocolError: peer closed connection without sending complete message body (incomplete chunked read)

server.txt

  File "/Users/mgoin/code/deepsparse/src/deepsparse/server/openai_server.py", line 159, in abort_request
    await pipeline.abort(request_id)
AttributeError: 'TextGenerationPipeline' object has no attribute 'abort'

What script did you use? The example script in the PR description? That seems to work for me. If you send me your code/example, I can investigate.