neuralmagic / deepsparse

Sparsity-aware deep learning inference runtime for CPUs
https://neuralmagic.com/deepsparse/
Other
2.97k stars 171 forks source link

[Fix] Appropriate whitespace missing in streaming output for Llama2, Mistral models #1431

Closed dbogunowicz closed 9 months ago

dbogunowicz commented 9 months ago

Fix for: https://app.asana.com/0/1205229323407165/1205993428418769/f

Testing

from deepsparse import Pipeline
model_path = "zoo:opt-1.3b-opt_pretrain-pruned50_quantW8A8"
model_path = "zoo:llama2-7b-gsm8k_llama2_pretrain-pruned80_quantized"
pipeline = Pipeline.create(task="text-generation", model_path=model_path, sequence_length=64)
generations = pipeline(prompt="Hi, my name is Slim", streaming=True)
print(("").join([g.generations[0].text for g in generations]))

Before:

.everyoneisamemberofthesamegroup,andthereare10membersinthegroup.
Ifyouareamemberofthegroup,andthereare10membersinthegroup,thenthegroupisdividedinto

Now:

. everyone is a member of the same group, and there are 10 members in the group.
If you are a member of the group, and there are 10 members in the group, then the group is divided into

Making sure that streaming for multiple prompts at once also works

from deepsparse import Pipeline
prompts=["Hi, my name is Slim Shady", "Napoleon was"]
model_path = "zoo:llama2-7b-gsm8k_llama2_pretrain-pruned80_quantized"
pipeline = Pipeline.create(task="text-generation", model_path=model_path, sequence_length=64)

generations_first_prompt_only = list(pipeline(prompt=prompts[0], streaming=True))
generations_second_prompt_only = list(pipeline(prompt=prompts[1], streaming=True))
text_generated_first_prompt_only = "".join([g.generations[0].text for g in generations_first_prompt_only])
text_generated_second_prompt_only = "".join([g.generations[0].text for g in generations_second_prompt_only])
print(f"Text one: {text_generated_first_prompt_only}")
print(f"Text two: {text_generated_second_prompt_only}")

bag_of_words_first_prompt_only = [g.generations[0].text for g in generations_first_prompt_only]
bag_of_words_second_prompt_only = [g.generations[0].text for g in generations_second_prompt_only]

generations = pipeline(prompt=prompts, streaming=True)
bag_of_words = []
for r in generations:
    for gen in r.generations:
        text = gen.text
        bag_of_words.append(text)

assert sorted(bag_of_words_first_prompt_only+bag_of_words_second_prompt_only) == sorted(bag_of_words)