keras-team / keras-nlp

Modular Natural Language Processing workflows with Keras
Apache License 2.0
740 stars 218 forks source link

Fix lowercase bug in wordpiece tokenizer #1543

Closed abuelnasr0 closed 3 months ago

abuelnasr0 commented 3 months ago

This PR fixes a bug in wordpiece tokenizer and it addresses this comment https://github.com/keras-team/keras-nlp/issues/1395#issuecomment-2021543142

abuelnasr0 commented 3 months ago

@mattdangerw last two commits are not necessary. First, I changed it to two stars "**" instead of "६", because "**" is impossible to be in the string so there is no possibillity to build a wrong mask. but I forgot that even if the mask was build wrong, There will be no effect, because tf_text.case_fold_utf8(text) have no effect on "६".

anyways I have a question about this two lines: https://github.com/keras-team/keras-nlp/blob/9ac333581fa55216ecd11e0fc566e12daef1cf82/keras_nlp/tokenizers/word_piece_tokenizer.py#L152-L153

what is the purpose of them? at the first glance they were not necessary for me. and I removed them, and all tests passed. We are splitting in cjx_regex and whitespace down, so no need to add white space before and after every cjx char? WDYT?

mattdangerw commented 3 months ago

what is the purpose of them? at the first glance they were not necessary for me. and I removed them, and all tests passed. We are splitting in cjx_regex and whitespace down, so no need to add white space before and after every cjx char? WDYT?

We might not! Here's context https://github.com/keras-team/keras-nlp/pull/318 and https://github.com/google-research/bert/blob/master/tokenization.py#L251-L262

I'll pull in this fix for now, as we will probably be cutting a release tomorrow, but we can follow up with any additional changes here.