Summary

This PR focuses on restructuring the pipeline.py and subgraph execution such that async and non-async functions are separated. For each of these, a Generator or AsyncGenerator can now also be returned, if the pipeline has set streaming in its inference_state. With this restructuring, we can now enable streaming.

Important Changes

Subgraph execution is now moved to a new helper class called SubGraphExecutor. This will execute a list of subgraphs depends on if the user is using the normal sync pathway, async pathway, sync pathway with streaming or async pathway with streaming. This should make execution cleaner and hopefully easier to follow.
To identify if the user is returning an output that should be yielded, there is a new StreamingOutput pydantic model which should be used by all pipelines that have streaming enabled. This model includes a field for the output that should be yielded to the user, and the output that should be returned to the operator to execute the next step in the pipeline. This was adapted from @dbogunowicz's previous PR
To simplify streaming vs non-streaming routes, pipelines can provide a generator_router which will be used instead when in streaming mode. This is simply to make it easier to determine the next step between the two pathways and not complicate the can_operate() functions, especially as the normal, non-streaming pathway is what we expect for most pipelines majority of the time. This is available in the TextGenerationPipeline

Other changes

Streaming tests are now enabled
When an AsyncGenerator is returned by the Pipeline, a fastapi StreamingResponse is now being returned. This is also what is done by OpenAI. Doesn't seem like the server had a good way to handle streaming/generator outputs before

Testing:

The two previously disabled streaming tests are now enabled
All other tests pass

Server Testing:

deepsparse.server --config_file sample_config.yaml

sample_config.yaml:

num_cores: 2
num_workers: 2
endpoints:
  - task: text_generation
    model: hf:neuralmagic/TinyLlama-1.1B-Chat-v0.4-pruned50-quant-ds
    kwargs:
      {"continuous_batch_sizes": [2, 4]}
  - task: question_answering
    model: zoo:nlp/question_answering/bert-base/pytorch/huggingface/squad/12layer_pruned80_quant-none-vnni

We can get streaming responses through the following request:


import time

import requests

url = "http://localhost:5543/v2/models/text_generation-0/infer"

obj = {
    "prompt": ["The sun shined", "Oh hello!"],
    "streaming": True,
    "generation_kwargs": {
        "max_length": 20,
    },
}

response = requests.post(url, json=obj, stream=True)
for chunk in response.iter_lines(): 
    if chunk:
        print(chunk)

Outside of the server:

model_path = "hf:mgoin/TinyStories-1M-ds"

pipeline = Pipeline.create(
    task="text_generation",
    model_path=model_path,
    engine_type="onnxruntime",
    internal_kv_cache=False,
    continuous_batch_sizes=[2,4]
)
output = pipeline(["the dog barked", "the sun shined"], streaming=True, generation_kwargs={"max_length": 20, "num_return_sequences": 4}, do_sample=True)
for o in output:
    print(o)

neuralmagic / deepsparse

[Pipeline Refactor][Text Generation] Updating Pipeline Execution and Enable Streaming #1484

Summary

Testing:

Server Testing: