Open ankitprezent opened 2 months ago
I'll need to get back to you - I think multi sequence gens might still be bugged, but unsure yet - let me get back to you
Hi, about this I have something.
When using Llama based models (ex. llama-3.1, codellama) it works both for chat and completion as in the example below:
match model_type:
case "chat":
messages_batch = []
for input_text in input_texts:
messages = [
{"role": "user", "content": CHAT_PROMPT.format(input_text)},
]
messages_batch.append(messages)
inputs = tokenizer.apply_chat_template(
messages_batch,
tokenize=True,
padding=True,
add_generation_prompt=True,
return_tensors="pt",
).to("cuda")
outputs = model.generate(
inputs,
max_new_tokens=max_new_tokens,
use_cache=True,
temperature=temperature,
return_dict_in_generate=True,
output_scores=True,
top_k=50,
do_sample=True,
num_return_sequences=num_return_sequences,
eos_token_id=tokenizer.encode(stop_token)[-1],
)
input_length = inputs.size(1)
generated_tokens = outputs.sequences[:, input_length:]
case "completion":
inputs = tokenizer([COMMAND_PROMPT.format(text) for text in input_texts], return_tensors="pt",
padding=True).to("cuda")
outputs = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
use_cache=True,
temperature=temperature,
return_dict_in_generate=True,
output_scores=True,
top_k=50,
do_sample=True,
num_return_sequences=num_return_sequences,
eos_token_id=tokenizer.encode(stop_token)[-1],
)
generated_tokens = outputs.sequences[:, inputs['input_ids'].size(-1):]
With gemma however it doesn't work at all (same code as above):
E0828 21:39:05.623000 126272167671616 torch/_subclasses/fake_tensor.py:1761] [1/102] RuntimeError: Attempting to broadcast a dimension of length s10 at -4! Mismatching argument at index 1 had torch.Size([s10, 1, s12, s10]); but expected shape should be broadcastable to [s4, s2*s3, s12, s10]
Seems that is due to a bug in the transformers library, that SHOULD have been fixed 15 days ago. I don't know if you need to reflect that change or not, I'll refer here the thread I found: https://huggingface.co/google/gemma-2-27b-it/discussions/28
Hope this helps!
I want to generate multiple output from single prompt. is there any way i can have multiple generation from fine-tuned llama3.1 model similar to what
num_return_sequences
do in huggingface?Hugginface Code
beam_outputs = model.generate( **model_inputs, max_new_tokens=40, num_beams=5, no_repeat_ngram_size=2, num_return_sequences=5, early_stopping=True )
print("Output:\n" + 100 * '-') for i, beam_output in enumerate(beam_outputs): print("{}: {}".format(i, tokenizer.decode(beam_output, skip_special_tokens=True)))