keras-team / keras-nlp

Modular Natural Language Processing workflows with Keras
Apache License 2.0
730 stars 215 forks source link

Documented `id_to_token` doesn't exist for UnicodeCodepointTokenizer #1631

Closed fecundf closed 3 weeks ago

fecundf commented 1 month ago

Describe the bug

The Keras.io docs for UnicodeCodepointTokenizer at https://keras.io/api/keras_nlp/tokenizers/unicode_codepoint_tokenizer/ show methods for that tokenizer that don't exist, and it's impeding my trials of word-based vs character-based tokenizing. In particular missing id_to_token means I can't drop in UnicodeCodepointTokenizer where I used to have a BERT-based tokenizer and see the debug output.

I know it's trivial to write id_to_token when the tokens are code-points, but it's one more thing I have to worry about when comparing these. Especially considering that it is documented as usable.

To Reproduce

tokenizer=keras_nlp.tokenizers.UnicodeCodepointTokenizer()
sample_tokens_int = [
    token.tolist() for word in sample_words for token in tokenizer("Hi")
]
# below gives error
# NotImplementedError: No implementation of `id_to_token()` was found for UnicodeCodepointTokenizer.
print(tokenizer.id_to_token(sample_tokens_int[0]))

Expected behavior

printing the letter "H" in this example

Additional context

Documentation at https://keras.io/api/keras_nlp/tokenizers/unicode_codepoint_tokenizer/ says id_to_token exists for this class, and it makes sense for all tokenizers to implement these

Would you like to help us fix it? Not sure I can

mehtamansi29 commented 1 month ago

Hi @fecundf -

I am getting this error "NameError: name 'sample_words' is not defined" while running your code snippet. Could you please provide me full code snippet for replicating the issue. Attached gist file for reference.

fecundf commented 1 month ago

Much simpler showing the problem (I edited the gist)

# This works
tokenizer=keras_nlp.tokenizers.WordPieceTokenizer(vocabulary=['Hi','there','people','[UNK]'])
print(tokenizer.id_to_token(2))

# This shows the missing id_to_token method
tokenizer=keras_nlp.tokenizers.UnicodeCodepointTokenizer()
print(tokenizer.id_to_token(2))
fecundf commented 1 month ago

even better example

# This works
tokenizer=keras_nlp.tokenizers.WordPieceTokenizer(vocabulary=['Hi','there','people','[UNK]'])
print(tokenizer.id_to_token(tokenizer.token_to_id("people"))) # shows "people"
print(tokenizer.id_to_token(2)) # also shows "people"

# This shows the missing id_to_token, token_to_id methods
tokenizer=keras_nlp.tokenizers.UnicodeCodepointTokenizer()
print(tokenizer.id_to_token(tokenizer.token_to_id("A"))) # should show "A"
print(tokenizer.id_to_token(65)) # should show "A"
mehtamansi29 commented 1 month ago

Hi @fecundf

I have reproduced the issue by the latest code snippet and it seems like id_to_token method is not implemented in UnicodeCodePointTokernizer Class. We need to check with keras-developer team. Thanks...!