Closed fecundf closed 3 weeks ago
Hi @fecundf -
I am getting this error "NameError: name 'sample_words' is not defined" while running your code snippet. Could you please provide me full code snippet for replicating the issue. Attached gist file for reference.
Much simpler showing the problem (I edited the gist)
# This works
tokenizer=keras_nlp.tokenizers.WordPieceTokenizer(vocabulary=['Hi','there','people','[UNK]'])
print(tokenizer.id_to_token(2))
# This shows the missing id_to_token method
tokenizer=keras_nlp.tokenizers.UnicodeCodepointTokenizer()
print(tokenizer.id_to_token(2))
even better example
# This works
tokenizer=keras_nlp.tokenizers.WordPieceTokenizer(vocabulary=['Hi','there','people','[UNK]'])
print(tokenizer.id_to_token(tokenizer.token_to_id("people"))) # shows "people"
print(tokenizer.id_to_token(2)) # also shows "people"
# This shows the missing id_to_token, token_to_id methods
tokenizer=keras_nlp.tokenizers.UnicodeCodepointTokenizer()
print(tokenizer.id_to_token(tokenizer.token_to_id("A"))) # should show "A"
print(tokenizer.id_to_token(65)) # should show "A"
Hi @fecundf
I have reproduced the issue by the latest code snippet and it seems like id_to_token method is not implemented in UnicodeCodePointTokernizer Class. We need to check with keras-developer team. Thanks...!
Describe the bug
The Keras.io docs for UnicodeCodepointTokenizer at https://keras.io/api/keras_nlp/tokenizers/unicode_codepoint_tokenizer/ show methods for that tokenizer that don't exist, and it's impeding my trials of word-based vs character-based tokenizing. In particular missing
id_to_token
means I can't drop in UnicodeCodepointTokenizer where I used to have a BERT-based tokenizer and see the debug output.I know it's trivial to write
id_to_token
when the tokens are code-points, but it's one more thing I have to worry about when comparing these. Especially considering that it is documented as usable.To Reproduce
Expected behavior
printing the letter "H" in this example
Additional context
Documentation at https://keras.io/api/keras_nlp/tokenizers/unicode_codepoint_tokenizer/ says id_to_token exists for this class, and it makes sense for all tokenizers to implement these
Would you like to help us fix it? Not sure I can