neuralmagic / deepsparse

Sparsity-aware deep learning inference runtime for CPUs
https://neuralmagic.com/deepsparse/
Other
2.97k stars 171 forks source link

[Pipeline Refactor] Fix Operator scheduling to fix issue with slow execution #1453

Closed dsikka closed 8 months ago

dsikka commented 9 months ago

Summary

Testing

Time Comparisons

Multiple Prompts (3) with Multiple Generations (4 per prompt):

ORT Continuous Batching, ORT Deepsparse with External KV Cache Continuous Batching, Deepsparse with External KV Cache Deepsparse with Internal KV Cache
v1 58.44 x 46.41 x 15.44
v2 60.33 47.34 46.24 32.00 14.23

Example test script:

import time

from deepsparse.transformers.pipelines.text_generation import TextGenerationInput
from deepsparse.v2.text_generation.pipeline import TextGenerationPipeline

pipeline = TextGenerationPipeline(
    model_path=model_path,
    engine_kwargs={"engine_type": "deepsparse"},
    internal_kv_cache=False
)

prompts = [["Hello there!", "The sun shined bright", "The dog barked"]]

input_value = TextGenerationInput(
    prompt=prompts[0],
    generation_kwargs={
        "num_return_sequences": 4,
        "max_new_tokens": 20,
        "do_sample": True,
    },
)
s = time.time()
output = pipeline(input_value)
e = time.time()
print("Total Time", e-s)
for i in output.generations:
    print(i)
    print("\n")

Sample Output:

Total Time 46.2393000125885
[GeneratedText(text=" I'm happy to announce that we've launched a new app for the Android device. Have you heard about", score=None, finished=True, finished_reason='max_new_tokens'), GeneratedText(text=' Let me know if you have any questions that I can help with.\nHi and welcome to the website', score=None, finished=True, finished_reason='max_new_tokens'), GeneratedText(text=' Let me tell you about a website I stumbled upon that I think you will enjoy, http://www.', score=None, finished=True, finished_reason='max_new_tokens'), GeneratedText(text=' I am writing this blog post to help you make the most out of your next camping trip. You will', score=None, finished=True, finished_reason='max_new_tokens')]

[GeneratedText(text=' and beautiful this past Sunday afternoon, illuminating the world around me.\nI sat down on my porch,', score=None, finished=True, finished_reason='max_new_tokens'), GeneratedText(text=', illuminating the dusty dirt roads that led to the small village. The people there had never seen such a', score=None, finished=True, finished_reason='max_new_tokens'), GeneratedText(text=' upon our faces, the light caressing every inch of our bodies. We walked barefooted through the', score=None, finished=True, finished_reason='max_new_tokens'), GeneratedText(text=' through the clouds, making everything seem lighter.\nThe sky was a vibrant shade of blue, with wis', score=None, finished=True, finished_reason='max_new_tokens')]

[GeneratedText(text=', and the children laughed and giggled.\nShe laughed and giggled throughout the whole conversation.\n', score=None, finished=True, finished_reason='max_new_tokens'), GeneratedText(text=' so loudly and repeatedly that the neighbors called 911 to report a disturbance at the home.\n\n“Someone', score=None, finished=True, finished_reason='max_new_tokens'), GeneratedText(text=" loud.\nBarking is a common sound in the house, and it's often associated with joy", score=None, finished=True, finished_reason='max_new_tokens'), GeneratedText(text=' wildly and the woman cried; "I don\'t know how to live!"\nSomewhere in the wilderness', score=None, finished=True, finished_reason='max_new_tokens')]

Comparing with v1:


from deepsparse.transformers.pipelines.text_generation import TextGenerationInput
from deepsparse import Pipeline
import time

model_path = "hf:neuralmagic/mpt-7b-chat-pruned50-quant"
pipeline = Pipeline.create(
    task="text_generation",
    model_path=model_path,
    engine_type="deepsparse",
    internal_kv_cache=False,
)

prompts = [["Hello there!", "The sun shined bright", "The dog barked"]]
input_value = TextGenerationInput(
    prompt=prompts[0],
    generation_kwargs={
        "num_return_sequences": 4,
        "max_new_tokens": 20,
        "do_sample": True,
    },
)
s = time.time()
output = pipeline(input_value)
e = time.time()
print("Total Time", e-s)
for i in output.generations:
    print(i)
    print("\n")

Sample Output

Total Time 46.41077995300293
[GeneratedText(text=' If you’re looking for some help with your marketing efforts, I’m here to help! Let', score=None, finished=True, finished_reason='max_new_tokens'), GeneratedText(text=' Welcome to my homepage!\nHi there! Thanks for dropping by my page, and apologies for the initial', score=None, finished=True, finished_reason='max_new_tokens'), GeneratedText(text=" Welcome to the Rhythm Collective's website. We are an independent artist collective based in New York City", score=None, finished=True, finished_reason='max_new_tokens'), GeneratedText(text=' We welcome you to visit our web site!\nWe have some exciting news for you, we are going', score=None, finished=True, finished_reason='max_new_tokens')]

[GeneratedText(text=' upon the earth, casting shadows across the landscape. The wind whistled through the trees and carried away leaves', score=None, finished=True, finished_reason='max_new_tokens'), GeneratedText(text=' on the ocean, and casting its light all over me.\nI thought about life and what it was', score=None, finished=True, finished_reason='max_new_tokens'), GeneratedText(text=' at the park this morning.\nI went for my first run in the morning and then I met with', score=None, finished=True, finished_reason='max_new_tokens'), GeneratedText(text=' today. The flowers blossomed, and the trees were painted with an array of colors. The birds chir', score=None, finished=True, finished_reason='max_new_tokens')]

[GeneratedText(text=' at her fiercely.\n“She couldn’t help but be mesmerized by the sight of him', score=None, finished=True, finished_reason='max_new_tokens'), GeneratedText(text=' at the mailman, which startled him and made him drop the package.\nI was surprised when I', score=None, finished=True, finished_reason='max_new_tokens'), GeneratedText(text=' at them.\nMeaning #1: The dog gave a loud bark.\nMeaning #2', score=None, finished=True, finished_reason='max_new_tokens'), GeneratedText(text=' for attention and ran for his life. The squirrel leapt away from the fox that came too close', score=None, finished=True, finished_reason='max_new_tokens')]