Closed abuelnasr0 closed 1 year ago
@abuelnasr0 Thanks for opening the issue!
@mattdangerw Matt, I think the code is by intention, but shall we add more details to the docstring? Seems we are doing some special handling on special tokens, is that a glitch from the SPE proto?
What's the issue? I suppose you are talking about the re-mapping of tokens.
Please go through these comments:
You can go through this too, to understand what's going on: https://github.com/huggingface/transformers/blob/main/src/transformers/models/xlm_roberta/tokenization_xlm_roberta.py#L171
@abheesht17 Thanks! I think for people reading the code, it may be arguably worth adding a comment "This tokenizer overrides more methods than usual because it needs to remap special tokens."
@abheesht17 the comments about re-mapping was vague to me, but the huggingface link clarified it. thanks!
@chenmoneygithub I will be adding the comment in the code
I think we can probably go ahead and close this one, as I think the tokenizer is working as intended.
@abuelnasr0 you are welcome to go straight to a PR if you have some comment clarifications you would like to add to the implementation!
But it terms of requirements for XLMRobertaTokenizer
, we should support loading the sentencepiece proto models unaltered from the original project, with all the "hacks" on top to support their tokenization flow.
99% of users will just be using the pre-trained sentencepiece proto files here. If a user really did want to train a model from scratch with a vanilla sentencepiece model that should already be well supported.
This is the "fun" with pre-trained model support, even if a preprocessing behavior is buggy or incorrect, you have to recreate the preprocessing used for pre-training exactly for good results for fine-tuning and inference.
all models tokenizers follow the same pattern except for XlmRoberta. and this implementation may cause problems I think. I don't know if it was implemented like that on purpose, but if not, I would like to re-implement it.