RoBERTa is not well implemented for tokenizers with pad_token_id != 1

dtamayo-nlp commented 4 weeks ago

Feature request

I would like to request that RoBERTa models correctly accept tokenizers with pad_token_id != 1. This problem is inherit from fairseq code.

Motivation

Problem Definition The current implementation of Roberta in transformers considers that the tokenizer has pad_token_id = 1:

In the TF implementation the pad token is explicitly set to 1.
In all the implementations, the class create_position_ids_from_input_ids indirectly considers the pad_token_id = 1 in all the classes. (modeling_roberta.py, modeling_tf_roberta.py and modeling_flax_roberta.py).

Motivation We have pre-trained a RoBERTa with another tokenizer from scratch and need to slightly change the current implementation to work correctly. The changes are minimal in create_position_ids_from_input_ids, we just need to change the + padding_idx terms for + 1 terms in the incremental positions. This change will not affect the original implementations of RoBERTa.

Your contribution

I can submit a PR with the modifications if you agree with the incorporation of this change.

Rocketknight1 commented 4 weeks ago

Yes, the RoBERTa code is quite idiosyncratic, and it inherits some strange shortcuts from fairseq. The position_ids interact with the padding token idx in a very strange way that only really makes sense when it's a fixed value.

I think we could accept this PR, but we'd need:

1) Equivalent modifications for the TF, FLAX and PT files 2) Regression tests to make sure nothing breaks!

cc @ArthurZucker @LysandreJik in case they have objections to changing older code - we might have to reject this PR because of the risk of damaging backward compatibility

ArthurZucker commented 4 days ago

TBH I don't even mind only fixing the PT path! 🤗

huggingface / transformers