Untokenization - Githubissues

studio-ousia / luke

LUKE -- Language Understanding with Knowledge-based Embeddings

Apache License 2.0

705 stars 102 forks source link

Untokenization #142

Closed taghreed34 closed 2 years ago

taghreed34 commented 2 years ago

Is there a way (a built-in method with the model or a set of rules to follow) to untokenize the resultant text tokens by LUKE tokenizer? I tokenized some text and dropped a number of random tokens and now I want to merge the rest of tokens to construct a new text.

I tried to do that by using reduce function and putting spaces between words depending on whether the word is in punctuation or not. I also removed the "Ġ" letter from the beginning, but it turned out that the used rules are not sufficient, so after reduction I get a text with unexpected number of tokens (more than expected).

ryokan0123 commented 2 years ago

Yes, for that purpose, Hugging Face tokenizers have this useful method convert_tokens_to_string.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("studio-ousia/luke-base")
tokens = tokenizer.tokenize("Hello world!")
print(tokens) # >> ['Hello', 'Ġworld', '!']
print(tokenizer.convert_tokens_to_string(tokens)) # >> Hello world!

taghreed34 commented 2 years ago

@Ryou0634 Great, thanks