huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
132.06k stars 26.31k forks source link

llama3 position_ids error with left padding #33095

Open ummagumm-a opened 3 weeks ago

ummagumm-a commented 3 weeks ago

Feature request

The LLaMA 3 implementation should generate default position_ids that take the attention_mask into account.

@ArthurZucker @younesbelkada

Motivation

Is there a specific reason why the default position_ids generation doesn’t consider the attention_mask? A friend mentioned that this issue has persisted for almost half a year now.

https://github.com/huggingface/transformers/blob/adb91179b9e867b7278e0130c87558974056c7b4/src/transformers/models/llama/modeling_llama.py#L962

The problem arises when using left padding, as the default position_ids start from the first index in the sequence length rather than from the first non-zero index in the attention_mask.

As far as I know, this is handled correctly in the generate function, but I would expect consistency during training as well. https://discuss.huggingface.co/t/llama-position-ids/75870

Your contribution

I can submit a PR changing

https://github.com/huggingface/transformers/blob/adb91179b9e867b7278e0130c87558974056c7b4/src/transformers/models/llama/modeling_llama.py#L963

to

position_ids = (attn_mask.cumsum(-1) - 1).clamp(min=0)
position_ids.masked_fill_(attn_mask.to(torch.bool) == 0, 0)

if noone has any objections on correctness of that. Otherwise, let's discuss.

AlekseyKorshuk commented 3 weeks ago

+1 same problem, need solution

ArthurZucker commented 2 weeks ago

Hey! feel free to open a PR, this should only be activated when training. I am not certain, but we try to avoid handling too many things as it bloats the codebase. Also in general when you train you have a lot more control on what you want to achieve with the position ids.

It's the first time I have seen a report about that TBH 🤗