Open pratikkotian04 opened 6 months ago
@pratikkotian04 https://github.com/huggingface/transformers/issues/17117 maybe can help?
@pratikkotian04 In this code
outputs = model.generate(**inputs, max_new_tokens = 128, use_cache = True)
tokenizer.batch_decode(outputs)
You can split the generated strring at ' ### Response:'. This will however save no time.
@danielhanchen's linked gh issue is indeed helpful. Sharing example code here for posterity
inputs = tokenizer(
[
template.format(<your inputs here>)
], return_tensors = "pt").to("cuda")
gen_tokens = model.generate(**inputs, max_new_tokens = 800, use_cache = False)
outs = tokenizer.batch_decode(gen_tokens[:, inputs.input_ids.shape[1]:])[0]
response_only = outs.replace(tokenizer.eos_token, "")
Oh wait actually if you're using TextStreamer
, you can use skip_prompt
as in our Ollama notebook: https://colab.research.google.com/drive/1WZDi7APtQ9VsvOrQSSC5DDtxq159j8iZ?usp=sharing
@danielhanchen's linked gh issue is indeed helpful. Sharing example code here for posterity
inputs = tokenizer( [ template.format(<your inputs here>) ], return_tensors = "pt").to("cuda") gen_tokens = model.generate(**inputs, max_new_tokens = 800, use_cache = False) outs = tokenizer.batch_decode(gen_tokens[:, inputs.input_ids.shape[1]:])[0] response_only = outs.replace(tokenizer.eos_token, "")
thanks, it helped
Below is the code I am using but this generated output with input and instructions.
alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference inputs = tokenizer( [ alpaca_prompt.format( "Continue the fibonnaci sequence.", # instruction "1, 1, 2, 3, 5, 8", # input "", # output - leave this blank for generation! ) ], return_tensors = "pt").to("cuda")
from transformers import TextStreamer textstreamer = TextStreamer(tokenizer) = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)
How do I modify this to only output the response.