Closed taghreed34 closed 2 years ago
Yes, for that purpose, Hugging Face tokenizers have this useful method convert_tokens_to_string
.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("studio-ousia/luke-base")
tokens = tokenizer.tokenize("Hello world!")
print(tokens) # >> ['Hello', 'Ġworld', '!']
print(tokenizer.convert_tokens_to_string(tokens)) # >> Hello world!
@Ryou0634 Great, thanks
Is there a way (a built-in method with the model or a set of rules to follow) to untokenize the resultant text tokens by LUKE tokenizer? I tokenized some text and dropped a number of random tokens and now I want to merge the rest of tokens to construct a new text.
I tried to do that by using reduce function and putting spaces between words depending on whether the word is in punctuation or not. I also removed the "Ġ" letter from the beginning, but it turned out that the used rules are not sufficient, so after reduction I get a text with unexpected number of tokens (more than expected).