Closed abuelnasr0 closed 3 months ago
@mattdangerw last two commits are not necessary. First, I changed it to two stars "**"
instead of "६"
, because "**"
is impossible to be in the string so there is no possibillity to build a wrong mask. but I forgot that even if the mask was build wrong, There will be no effect, because tf_text.case_fold_utf8(text)
have no effect on "६".
anyways I have a question about this two lines: https://github.com/keras-team/keras-nlp/blob/9ac333581fa55216ecd11e0fc566e12daef1cf82/keras_nlp/tokenizers/word_piece_tokenizer.py#L152-L153
what is the purpose of them? at the first glance they were not necessary for me. and I removed them, and all tests passed. We are splitting in cjx_regex and whitespace down, so no need to add white space before and after every cjx char? WDYT?
what is the purpose of them? at the first glance they were not necessary for me. and I removed them, and all tests passed. We are splitting in cjx_regex and whitespace down, so no need to add white space before and after every cjx char? WDYT?
We might not! Here's context https://github.com/keras-team/keras-nlp/pull/318 and https://github.com/google-research/bert/blob/master/tokenization.py#L251-L262
I'll pull in this fix for now, as we will probably be cutting a release tomorrow, but we can follow up with any additional changes here.
This PR fixes a bug in wordpiece tokenizer and it addresses this comment https://github.com/keras-team/keras-nlp/issues/1395#issuecomment-2021543142