neuralmagic / deepsparse

Sparsity-aware deep learning inference runtime for CPUs
https://neuralmagic.com/deepsparse/
Other
2.94k stars 169 forks source link

[TextGeneration] Fix llama tokenizer #1635

Closed dsikka closed 3 months ago

dsikka commented 3 months ago

Tested code:


import deepsparse

MODEL_ID = "hf:nm-testing/llama2-7B-sparse70-retrained-ultrachat200k-pruned70-smoothquant-ds"
#MODEL_ID = "zoo:mistral-7b-ultrachat200k_mistral_pretrain-pruned40_quantized"

pipe = deepsparse.Pipeline.create(
    task="text-generation",
    model_path=MODEL_ID,
    sequence_length=512,
    prompt_sequence_length=16,
)

message = "Once upon a time"

conversation = []
conversation.append({"role": "user", "content": message})
formatted_conversation = pipe.tokenizer.apply_chat_template(
    conversation, tokenize=False, add_generation_prompt=True
)

generation_config = {
    "max_new_tokens": 100,
}

inference = pipe(
    sequences=formatted_conversation,
    generation_config=generation_config,
    streaming=True,
)

for token in inference:
    print(token.generations[0].text, end="")

Output:


There was a time when the world was a different place. A time when people were more accepting of each other and didn't judge based on race, religion, or gender. A time when kindness and compassion were the norm, and hate and prejudice were unheard of.

But then something changed. The world became more divided, and people started to see each other through a