neuralmagic / deepsparse

Sparsity-aware deep learning inference runtime for CPUs
https://neuralmagic.com/deepsparse/
Other
2.94k stars 169 forks source link

Add support for tokenized prompt input for CompletionRequest #1526

Closed mgoin closed 6 months ago

mgoin commented 6 months ago

Found through lm-evaluation-harness since for the OpenAI Completion interface they make requests using raw token_ids

OAI docs reference: https://platform.openai.com/docs/api-reference/completions

prompt: The prompt(s) to generate completions for, encoded as a string, array of strings, array of tokens, or array of token arrays.

Example request during gsm8k evaluation:

{'model': 'hf:mgoin/llama-2-7b-gsm8k-pruned60-quant-ds', 'prompt': [[1, 894, 29901, 21828, 18577, 29871, 29929, 29900, 9814, 273, 21425, 322, 29871, 29946, 29900, 28145, 5697, 348, 3173, 393, 9814, 273, 21425, 29889, 1128, 1784, 18281, 947, 540, 8024, 3001, 29973, 13, 22550, 29901]], 'max_tokens': 256, 'stop': ['<|endoftext|>'], 'seed': 1234, 'temperature': 0.0}

Server command:

deepsparse.server --integration openai --task text-generation --model_path hf:mgoin/llama-2-7b-gsm8k-pruned60-quant-ds

Eval command (using this branch https://github.com/EleutherAI/lm-evaluation-harness/pull/1277):

lm_eval --model local-completions --model_args base_url=http://localhost:5543/v1,model=hf:mgoin/llama-2-7b-gsm8k-pruned60-quant-ds,tokenizer_backend=huggingface,tokenizer=mgoin/llama-2-7b-gsm8k-pruned60-quant-ds --tasks gsm8k --num_fewshot 0
mgoin commented 6 months ago

please add a test for this input case

very good call, missed that file