Possible bugs when using generate_response for batched inference

It took me a few day to figure out what's wrong when evaluating the trained lora.

`def generate_response(prompt, model, temperature = 0.1, num_beams = 1, top_k = 50, repetition_penalty=1): encoding = tokenizer(prompt, padding=True, truncation=True, return_tensors="pt", max_length = 1024) input_ids = encoding["input_ids"].to(device) attention_mask = encoding['attention_mask'].to(device) # I just added

generation_config = GenerationConfig(
    temperature=temperature,
    top_p=1,
    do_sample = True,
    num_beams = num_beams,
    top_k = top_k,
    repetition_penalty = repetition_penalty
)
with torch.inference_mode():
    return model.generate(
        input_ids=input_ids,
        attention_mask=attention_mask,
        generation_config=generation_config,
        return_dict_in_generate=True,
        output_scores=True,
        max_new_tokens=512,
    )`

The reason is the padding token still contributes its weight to generation, if attention_mask is not provided.

tloen / alpaca-lora

Possible bugs when using generate_response for batched inference #589