runpod-workers / worker-vllm

The RunPod worker template for serving our large language model endpoints. Powered by vLLM.
MIT License
213 stars 81 forks source link

Only generates 16 tokens #74

Closed lawrenceztang closed 2 months ago

lawrenceztang commented 2 months ago

I tried phi-2 and llama-3-instruct and they both only generate 16 tokens. How can I change this?

lawrenceztang commented 2 months ago

This was confusing to me, but in order to increase the number of tokens to 100 you have to format the request like this:

{ "input": { "prompt": "Hello World", "sampling_params": { "max_tokens": 100 } } }