keras-team / keras-nlp

Modular Natural Language Processing workflows with Keras
Apache License 2.0
740 stars 218 forks source link

Add `special_tokens_in_strings` Arg to byte_pair_tokenizer. #1546

Open abuelnasr0 opened 3 months ago

abuelnasr0 commented 3 months ago

I opened this PR instead of keras-team/keras-nlp#1447. This PR:

  1. Adds special_tokens_in_strings Arg to byte_pair_tokenizer.
  2. solves the bug of tokenizing <s> and </s> to the same id.
  3. moves special tokens checking into the base class.

I also renamed unsplittable_tokens to special_tokens to be similar to other tokenizers. not sure if it's necessary.