huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
135.67k stars 27.16k forks source link

RoBERTa is not well implemented for tokenizers with pad_token_id != 1 #34528

Open dtamayo-nlp opened 4 weeks ago

dtamayo-nlp commented 4 weeks ago

Feature request

I would like to request that RoBERTa models correctly accept tokenizers with pad_token_id != 1. This problem is inherit from fairseq code.

Motivation

Problem Definition The current implementation of Roberta in transformers considers that the tokenizer has pad_token_id = 1:

Motivation We have pre-trained a RoBERTa with another tokenizer from scratch and need to slightly change the current implementation to work correctly. The changes are minimal in create_position_ids_from_input_ids, we just need to change the + padding_idx terms for + 1 terms in the incremental positions. This change will not affect the original implementations of RoBERTa.

Your contribution

I can submit a PR with the modifications if you agree with the incorporation of this change.

Rocketknight1 commented 4 weeks ago

Yes, the RoBERTa code is quite idiosyncratic, and it inherits some strange shortcuts from fairseq. The position_ids interact with the padding token idx in a very strange way that only really makes sense when it's a fixed value.

I think we could accept this PR, but we'd need:

1) Equivalent modifications for the TF, FLAX and PT files 2) Regression tests to make sure nothing breaks!

cc @ArthurZucker @LysandreJik in case they have objections to changing older code - we might have to reject this PR because of the risk of damaging backward compatibility

ArthurZucker commented 4 days ago

TBH I don't even mind only fixing the PT path! 🤗