Summary

Enable the OpenAI server to use the new pipeline
Disable streaming within the OpenAI server as not available in v2 as of yet

Testing

We can now use continuous batching with OpenAI:


num_cores: 2
num_workers: 2
endpoints:
  - task: text_generation
    model: hf:neuralmagic/TinyLlama-1.1B-Chat-v0.4-pruned50-quant-ds
    kwargs:
      {"continuous_batch_sizes": [2], "internal_kv_cache": False}

Starting this server:

deepsparse.server --config_file new_sample_config.yaml --integration openai

Using the API:


import openai
from openai import OpenAI

client = OpenAI(base_url="http://localhost:5543/v1", api_key="EMPTY")

models = client.models.list()

model = "hf:neuralmagic/TinyLlama-1.1B-Chat-v0.4-pruned50-quant-ds"
print(f"Accessing model API '{model}'")

# Completion API
stream = False
completion = client.completions.create(
    prompt="The sun shined",
    stream=stream,
    n=8,
    max_tokens=10,
    model=model
)

print(completion)

neuralmagic / deepsparse

[Pipeline Refactor][server][OpenAI] Enable OpenAI to use new text gen pipeline #1477

Summary

Testing