neuralmagic / deepsparse

Sparsity-aware deep learning inference runtime for CPUs
https://neuralmagic.com/deepsparse/
Other
2.97k stars 171 forks source link

[Pipeline Refactor][Text Generation] Updating Pipeline Execution and Enable Streaming #1484

Closed dsikka closed 8 months ago

dsikka commented 8 months ago

Summary

This PR focuses on restructuring the pipeline.py and subgraph execution such that async and non-async functions are separated. For each of these, a Generator or AsyncGenerator can now also be returned, if the pipeline has set streaming in its inference_state. With this restructuring, we can now enable streaming.

Important Changes

Other changes

Testing:

Server Testing:

deepsparse.server --config_file sample_config.yaml

sample_config.yaml:

num_cores: 2
num_workers: 2
endpoints:
  - task: text_generation
    model: hf:neuralmagic/TinyLlama-1.1B-Chat-v0.4-pruned50-quant-ds
    kwargs:
      {"continuous_batch_sizes": [2, 4]}
  - task: question_answering
    model: zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/12layer_pruned80_quant-none-vnni

We can get streaming responses through the following request:


import time

import requests

url = "http://localhost:5543/v2/models/text_generation-0/infer"

obj = {
    "prompt": ["The sun shined", "Oh hello!"],
    "streaming": True,
    "generation_kwargs": {
        "max_length": 20,
    },
}

response = requests.post(url, json=obj, stream=True)
for chunk in response.iter_lines(): 
    if chunk:
        print(chunk)

Outside of the server:

model_path = "hf:mgoin/TinyStories-1M-ds"

pipeline = Pipeline.create(
    task="text_generation",
    model_path=model_path,
    engine_type="onnxruntime",
    internal_kv_cache=False,
    continuous_batch_sizes=[2,4]
)
output = pipeline(["the dog barked", "the sun shined"], streaming=True, generation_kwargs={"max_length": 20, "num_return_sequences": 4}, do_sample=True)
for o in output:
    print(o)
dsikka commented 8 months ago

Quality check failing as the license was removed from the base README.md on main