Closed vamsikrishnav closed 1 year ago
ALLOWED_MAX_NEW_TOKENS
needs to be passed in Makefile.
The default value is set to 100.
https://github.com/huggingface/transformers-bloom-inference/blob/4fe1cb9b92bc210538d8b19ee8ff7b9a57dc7382/inference_server/server.py#L36
I had introduced this parameter because some users were requesting a lot of generated tokens which was slowing down the service since in codebase, the user queries are not batched up and are sequential. 🤗
what ever the max new token size is the output tokens generated is always 100. Please take a look at the few call I made using curl