neuralmagic / deepsparse

Sparsity-aware deep learning inference runtime for CPUs
https://neuralmagic.com/deepsparse/
Other
3.01k stars 176 forks source link

[Text Generation][V2][Fix] Properly digest `max_new_tokens` argument #1403

Closed dbogunowicz closed 11 months ago

dbogunowicz commented 11 months ago

Feature Description

Two arguments specify the number of generated tokens: max_lenght (includes prompt length) and max_new_tokens (does not include prompt length); as specified in GenerationDefaults in src.deepsparse.v2.text_generation.process_inputs.py:

class GenerationDefaults:
    num_return_sequences = 1
    max_length = 100
    max_new_tokens = None
    output_scores = False
    top_k = 0
    top_p = 0.0
    repetition_penalty = 0.0
    do_sample = False
    temperature = 1.0

However, the logic of the pipeline was such, that it always used max_length argument and ignored max_new_tokens argument. This diff assumes that if max_new_tokens != None, it would take precedence over max_length.

This is demonstrated by this code snippet:

from deepsparse.v2.text_generation.pipeline import TextGenerationPipeline
model_path = "hf:mgoin/TinyStories-1M-deepsparse"
max_new_tokens = 64
pipeline = TextGenerationPipeline(model_path,generation_config=dict(max_new_tokens=max_new_tokens), force_max_tokens=True)
out = pipeline(prompt=["Get more cheese than doritos, cheetos, or fritos"])
print(f"Number of prompt tokens: {len(pipeline.tokenizer.tokenize(out.prompts[0]))}")
print(f"Number of generated tokens: {len(pipeline.tokenizer.tokenize(out.generations[0].text))} versus {max_new_tokens}")

Before:

Number of prompt tokens: 17
Number of generated tokens: 19 versus 64

Now:

Number of prompt tokens: 17
Number of generated tokens: 64 versus 64

Testing

All tests in tests.deepsparse.v2.unit and tests.deepsparse.transformers pass.