Open dtamayo-nlp opened 4 weeks ago
Yes, the RoBERTa code is quite idiosyncratic, and it inherits some strange shortcuts from fairseq. The position_ids
interact with the padding token idx in a very strange way that only really makes sense when it's a fixed value.
I think we could accept this PR, but we'd need:
1) Equivalent modifications for the TF, FLAX and PT files 2) Regression tests to make sure nothing breaks!
cc @ArthurZucker @LysandreJik in case they have objections to changing older code - we might have to reject this PR because of the risk of damaging backward compatibility
TBH I don't even mind only fixing the PT path! 🤗
Feature request
I would like to request that RoBERTa models correctly accept tokenizers with pad_token_id != 1. This problem is inherit from
fairseq
code.Motivation
Problem Definition The current implementation of Roberta in transformers considers that the tokenizer has pad_token_id = 1:
create_position_ids_from_input_ids
indirectly considers the pad_token_id = 1 in all the classes. (modeling_roberta.py, modeling_tf_roberta.py and modeling_flax_roberta.py).Motivation We have pre-trained a RoBERTa with another tokenizer from scratch and need to slightly change the current implementation to work correctly. The changes are minimal in
create_position_ids_from_input_ids
, we just need to change the+ padding_idx
terms for+ 1
terms in the incremental positions. This change will not affect the original implementations of RoBERTa.Your contribution
I can submit a PR with the modifications if you agree with the incorporation of this change.