Re-implement XlmRoberta tokenizer

keras-team / keras-hub

Pretrained model hub for Keras 3

Apache License 2.0

765 stars 230 forks source link

Re-implement XlmRoberta tokenizer #930

Closed abuelnasr0 closed 1 year ago

abuelnasr0 commented 1 year ago

all models tokenizers follow the same pattern except for XlmRoberta. and this implementation may cause problems I think. I don't know if it was implemented like that on purpose, but if not, I would like to re-implement it.

chenmoneygithub commented 1 year ago

@abuelnasr0 Thanks for opening the issue!

@mattdangerw Matt, I think the code is by intention, but shall we add more details to the docstring? Seems we are doing some special handling on special tokens, is that a glitch from the SPE proto?

abheesht17 commented 1 year ago

What's the issue? I suppose you are talking about the re-mapping of tokens.

Please go through these comments:

https://github.com/keras-team/keras-nlp/blob/master/keras_nlp/models/xlm_roberta/xlm_roberta_tokenizer.py#L38

https://github.com/keras-team/keras-nlp/blob/master/keras_nlp/models/xlm_roberta/xlm_roberta_tokenizer.py#L134

https://github.com/keras-team/keras-nlp/blob/master/keras_nlp/models/xlm_roberta/xlm_roberta_tokenizer.py#L149

https://github.com/keras-team/keras-nlp/blob/master/keras_nlp/models/xlm_roberta/xlm_roberta_tokenizer.py#L157

abheesht17 commented 1 year ago

You can go through this too, to understand what's going on: https://github.com/huggingface/transformers/blob/main/src/transformers/models/xlm_roberta/tokenization_xlm_roberta.py#L171

chenmoneygithub commented 1 year ago

@abheesht17 Thanks! I think for people reading the code, it may be arguably worth adding a comment "This tokenizer overrides more methods than usual because it needs to remap special tokens."

abuelnasr0 commented 1 year ago

@abheesht17 the comments about re-mapping was vague to me, but the huggingface link clarified it. thanks!

soma2000-lang commented 1 year ago

@chenmoneygithub I will be adding the comment in the code

mattdangerw commented 1 year ago

I think we can probably go ahead and close this one, as I think the tokenizer is working as intended.

@abuelnasr0 you are welcome to go straight to a PR if you have some comment clarifications you would like to add to the implementation!

But it terms of requirements for XLMRobertaTokenizer, we should support loading the sentencepiece proto models unaltered from the original project, with all the "hacks" on top to support their tokenization flow.

99% of users will just be using the pre-trained sentencepiece proto files here. If a user really did want to train a model from scratch with a vanilla sentencepiece model that should already be well supported.

This is the "fun" with pre-trained model support, even if a preprocessing behavior is buggy or incorrect, you have to recreate the preprocessing used for pre-training exactly for good results for fine-tuning and inference.