neuralmagic / deepsparse

Sparsity-aware deep learning inference runtime for CPUs
https://neuralmagic.com/deepsparse/
Other
2.97k stars 171 forks source link

[Pipeline Refactor][server][OpenAI] Enable OpenAI to use new text gen pipeline #1477

Closed dsikka closed 8 months ago

dsikka commented 8 months ago

Summary

Testing

We can now use continuous batching with OpenAI:


num_cores: 2
num_workers: 2
endpoints:
  - task: text_generation
    model: hf:neuralmagic/TinyLlama-1.1B-Chat-v0.4-pruned50-quant-ds
    kwargs:
      {"continuous_batch_sizes": [2], "internal_kv_cache": False}

Starting this server:

deepsparse.server --config_file new_sample_config.yaml --integration openai

Using the API:


import openai
from openai import OpenAI

client = OpenAI(base_url="http://localhost:5543/v1", api_key="EMPTY")

models = client.models.list()

model = "hf:neuralmagic/TinyLlama-1.1B-Chat-v0.4-pruned50-quant-ds"
print(f"Accessing model API '{model}'")

# Completion API
stream = False
completion = client.completions.create(
    prompt="The sun shined",
    stream=stream,
    n=8,
    max_tokens=10,
    model=model
)

print(completion)