keras-team / keras-nlp

Modular Natural Language Processing workflows with Keras
Apache License 2.0
759 stars 227 forks source link

keras_nlp.tokenizers.WordPieceTokenizer not reading txt vocabulary #663

Closed milmor closed 3 weeks ago

milmor commented 1 year ago

When passing 'test_wiki' filename to WordPieceTokenize:

tokenizer = keras_nlp.tokenizers.WordPieceTokenizer(
    vocabulary='test_wiki',
    sequence_length=seq_len + 1,
    lowercase=False,
)

get the following error:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Input In [64], in <cell line: 1>()
----> 1 tokenizer = keras_nlp.tokenizers.WordPieceTokenizer(
      2     vocabulary='test_wiki',
      3     sequence_length=seq_len + 1,
      4     lowercase=False,
      5 )

File ~/anaconda3/envs/tf-28/lib/python3.8/site-packages/keras_nlp/tokenizers/word_piece_tokenizer.py:339, in WordPieceTokenizer.__init__(self, vocabulary, sequence_length, lowercase, strip_accents, split, split_on_cjk, suffix_indicator, oov_token, **kwargs)
    330 if oov_token not in self.vocabulary:
    331     raise RuntimeError(
    332         f'Cannot find `oov_token="{self.oov_token}"` in the '
    333         "vocabulary.\n"
   (...)
    336         "the `oov_token` argument when creating the tokenizer."
    337     )
--> 339 self._fast_word_piece = tf_text.FastWordpieceTokenizer(
    340     vocab=self.vocabulary,
    341     token_out_type=self.compute_dtype,
    342     suffix_indicator=suffix_indicator,
    343     unknown_token=oov_token,
    344     no_pretokenization=True,
    345     support_detokenization=True,
    346 )

File ~/anaconda3/envs/tf-28/lib/python3.8/site-packages/tensorflow_text/python/ops/fast_wordpiece_tokenizer.py:106, in FastWordpieceTokenizer.__init__(self, vocab, suffix_indicator, max_bytes_per_word, token_out_type, unknown_token, no_pretokenization, support_detokenization, model_buffer)
    102 _tf_text_fast_wordpiece_tokenizer_op_create_counter.get_cell().increase_by(
    103     1)
    105 if model_buffer is None:
--> 106   model_buffer = (pywrap_fast_wordpiece_tokenizer_model_builder
    107                   .build_fast_wordpiece_model(
    108                       vocab, max_bytes_per_word, suffix_indicator,
    109                       unknown_token, no_pretokenization,
    110                       support_detokenization))
    111 # Use uint8 tensor as a buffer for the model to avoid any possible changes,
    112 # for example truncation by '\0'.
    113 if isinstance(model_buffer, ops.Tensor):

RuntimeError: Tokens in the vocabulary must be unique.

but when passing exactly the same vocabulary in list format, there is no error:

tokenizer = keras_nlp.tokenizers.WordPieceTokenizer(
    vocabulary=vocab,
    sequence_length=seq_len + 1,
    lowercase=False,
)

I'm saving the vocab created by:

vocab = keras_nlp.tokenizers.compute_word_piece_vocabulary(
    raw_train_ds,
    vocabulary_size=vocab_size,
    lowercase=False,
    reserved_tokens=["[PAD]", "[UNK]", "[BOS]"],
)

with the following:

def write_vocab_file(vocab_file, vocab):
    with open(vocab_file, 'w') as f:
        for token in vocab:
            print(token, file=f)
milmor commented 1 year ago

Sorry guys, I've found the bug, is due '\xa0'. How can I fix this? In the text file is saved as a blank space.

milmor commented 1 year ago

I have fixed this haha, just use tf_text.normalize_utf8(text, 'NFKD') to standardize the text before creating your vocab. Sorry guys.

jbischof commented 1 year ago

Thanks @milmor this is actually quite interesting!

@mattdangerw should we include a note in the docstring? It seems difficult to catch the exception without a lot of overhead.

mattdangerw commented 1 year ago

Yeah this is interesting, thanks! @milmor is it possible to repro the issue in a colab? So we could take a look at the fix you applied and issue without the fix?