huggingface / transformers-bloom-inference

Fast Inference Solutions for BLOOM
Apache License 2.0
560 stars 114 forks source link

Max tokens generated remains constant for whatever the input token size #55

Closed vamsikrishnav closed 1 year ago

vamsikrishnav commented 1 year ago

what ever the max new token size is the output tokens generated is always 100. Please take a look at the few call I made using curl

❯ curl 'http://myserver.com:8000/generate/' \
  -H 'Content-Type: application/json; charset=UTF-8' \
  --data-raw '{"text":["what is text generation?"],"temperature":1,"top_k":50,"top_p":1,"max_new_tokens":256,"repetition_penalty":1,"do_sample":true,"remove_input_from_output":true}' \
  --compressed \
  --insecure
{"method":"generate","num_generated_tokens":[100],"query_id":54,"text":["\u201d. And after I get into the nitty gritty of how it works.\nThis is a simple way to get into language generation. There are other ways to get into it, but this gets right to the heart of it.\nThere is a book called How to Learn Any Language, that takes a very interesting approach to the whole situation. But it is expensive, so you have to be sure that it is on par with your goals. I would suggest taking a look at it.\nThe best"],"total_time_taken":"10.58 secs"}
❯ curl 'http://myserver.com:8000/generate/' \
  -H 'Content-Type: application/json; charset=UTF-8' \
  --data-raw '{"text":["write hello world program in C++"],"temperature":1,"top_k":50,"top_p":1,"max_new_tokens":256,"repetition_penalty":1,"do_sample":true,"remove_input_from_output":true}' \
  --compressed \
  --insecure
{"method":"generate","num_generated_tokens":[100],"query_id":56,"text":[". In that program I also used a class. I have made some change in it. After that when I am using some commands on that file, it shows me some error. I think error is coming when i am compiling that file. But there is not any error in the program. This program is written with #include iostream class with using namespace std;. So help me. Sorry for my poor English. Please provide me a good solution. Or any new way to make this.\n#include <iostream>\n\n"],"total_time_taken":"10.59 secs"}
❯ curl 'http://myserver.com:8000/generate/' \
  -H 'Content-Type: application/json; charset=UTF-8' \
  --data-raw '{"text":["write hello world program in python"],"temperature":1,"top_k":50,"top_p":1,"max_new_tokens":200,"repetition_penalty":1,"do_sample":true,"remove_input_from_output":true}' \
  --compressed \
  --insecure
{"method":"generate","num_generated_tokens":[100],"query_id":57,"text":["\nimport os, sys\nimport time as time\nfrom random import *\nn=10\n\ndef random(n):\n    x = int(random() * n)\n    return x\n\ndef main():\n    global n\n    n = random(n)\n    print n\n    print random(n)\n    print n\n\nmain()\n\nThe script is about creating a random number, using rand() and the time() object.\nCan someone help me with this problem? Thanks in advance.\nIt needs to be done"],"total_time_taken":"10.58 secs"}
mayank31398 commented 1 year ago

ALLOWED_MAX_NEW_TOKENS needs to be passed in Makefile. The default value is set to 100. https://github.com/huggingface/transformers-bloom-inference/blob/4fe1cb9b92bc210538d8b19ee8ff7b9a57dc7382/inference_server/server.py#L36

mayank31398 commented 1 year ago

I had introduced this parameter because some users were requesting a lot of generated tokens which was slowing down the service since in codebase, the user queries are not batched up and are sequential. 🤗