unslothai / unsloth

Finetune Llama 3.2, Mistral, Phi, Qwen 2.5 & Gemma LLMs 2-5x faster with 80% less memory
https://unsloth.ai
Apache License 2.0
18.25k stars 1.27k forks source link

Multiple Generation Similar to Huggingface `num_return_sequences` #959

Open ankitprezent opened 2 months ago

ankitprezent commented 2 months ago

I want to generate multiple output from single prompt. is there any way i can have multiple generation from fine-tuned llama3.1 model similar to what num_return_sequences do in huggingface?

Hugginface Code

beam_outputs = model.generate( **model_inputs, max_new_tokens=40, num_beams=5, no_repeat_ngram_size=2, num_return_sequences=5, early_stopping=True )

print("Output:\n" + 100 * '-') for i, beam_output in enumerate(beam_outputs): print("{}: {}".format(i, tokenizer.decode(beam_output, skip_special_tokens=True)))

danielhanchen commented 2 months ago

I'll need to get back to you - I think multi sequence gens might still be bugged, but unsure yet - let me get back to you

scuty2000 commented 2 months ago

Hi, about this I have something.

When using Llama based models (ex. llama-3.1, codellama) it works both for chat and completion as in the example below:

match model_type:
case "chat":
    messages_batch = []
    for input_text in input_texts:
        messages = [
            {"role": "user", "content": CHAT_PROMPT.format(input_text)},
        ]
        messages_batch.append(messages)
    inputs = tokenizer.apply_chat_template(
        messages_batch,
        tokenize=True,
        padding=True,
        add_generation_prompt=True,
        return_tensors="pt",
    ).to("cuda")
    outputs = model.generate(
        inputs,
        max_new_tokens=max_new_tokens,
        use_cache=True,
        temperature=temperature,
        return_dict_in_generate=True,
        output_scores=True,
        top_k=50,
        do_sample=True,
        num_return_sequences=num_return_sequences,
        eos_token_id=tokenizer.encode(stop_token)[-1],
    )
    input_length = inputs.size(1)
    generated_tokens = outputs.sequences[:, input_length:]
case "completion":
    inputs = tokenizer([COMMAND_PROMPT.format(text) for text in input_texts], return_tensors="pt",
                       padding=True).to("cuda")
    outputs = model.generate(
        **inputs,
        max_new_tokens=max_new_tokens,
        use_cache=True,
        temperature=temperature,
        return_dict_in_generate=True,
        output_scores=True,
        top_k=50,
        do_sample=True,
        num_return_sequences=num_return_sequences,
        eos_token_id=tokenizer.encode(stop_token)[-1],
    )
    generated_tokens = outputs.sequences[:, inputs['input_ids'].size(-1):]

With gemma however it doesn't work at all (same code as above): E0828 21:39:05.623000 126272167671616 torch/_subclasses/fake_tensor.py:1761] [1/102] RuntimeError: Attempting to broadcast a dimension of length s10 at -4! Mismatching argument at index 1 had torch.Size([s10, 1, s12, s10]); but expected shape should be broadcastable to [s4, s2*s3, s12, s10]

Seems that is due to a bug in the transformers library, that SHOULD have been fixed 15 days ago. I don't know if you need to reflect that change or not, I'll refer here the thread I found: https://huggingface.co/google/gemma-2-27b-it/discussions/28

Hope this helps!