unslothai / unsloth

Finetune Llama 3.2, Mistral, Phi, Qwen & Gemma LLMs 2-5x faster with 80% less memory

https://unsloth.ai

Apache License 2.0

17.9k stars 1.24k forks source link

How do I output only the response and not the Instructions and Input #335

Open pratikkotian04 opened 6 months ago

pratikkotian04 commented 6 months ago

Below is the code I am using but this generated output with input and instructions.

alpaca_prompt = Copied from above

FastLanguageModel.for_inference(model) # Enable native 2x faster inference inputs = tokenizer( [ alpaca_prompt.format( "Continue the fibonnaci sequence.", # instruction "1, 1, 2, 3, 5, 8", # input "", # output - leave this blank for generation! ) ], return_tensors = "pt").to("cuda")

from transformers import TextStreamer textstreamer = TextStreamer(tokenizer) = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)

How do I modify this to only output the response.

danielhanchen commented 6 months ago

@pratikkotian04 https://github.com/huggingface/transformers/issues/17117 maybe can help?

erwe324 commented 6 months ago

@pratikkotian04 In this code

outputs = model.generate(**inputs, max_new_tokens = 128, use_cache = True)
tokenizer.batch_decode(outputs)

You can split the generated strring at ' ### Response:'. This will however save no time.

kmahorker commented 4 months ago

@danielhanchen's linked gh issue is indeed helpful. Sharing example code here for posterity

inputs = tokenizer(
[
    template.format(<your inputs here>)
], return_tensors = "pt").to("cuda")

gen_tokens = model.generate(**inputs, max_new_tokens = 800, use_cache = False)
outs = tokenizer.batch_decode(gen_tokens[:, inputs.input_ids.shape[1]:])[0]
response_only = outs.replace(tokenizer.eos_token, "")

danielhanchen commented 4 months ago

Oh wait actually if you're using TextStreamer, you can use skip_prompt as in our Ollama notebook: https://colab.research.google.com/drive/1WZDi7APtQ9VsvOrQSSC5DDtxq159j8iZ?usp=sharing

talhaanwarch commented 1 month ago

@danielhanchen's linked gh issue is indeed helpful. Sharing example code here for posterity

inputs = tokenizer(
[
    template.format(<your inputs here>)
], return_tensors = "pt").to("cuda")

gen_tokens = model.generate(**inputs, max_new_tokens = 800, use_cache = False)
outs = tokenizer.batch_decode(gen_tokens[:, inputs.input_ids.shape[1]:])[0]
response_only = outs.replace(tokenizer.eos_token, "")

thanks, it helped