Closed timothyjlaurent closed 4 years ago
Maybe it's as simple as
tokenizer.add_tokens(AddedToken("<AGE>", single_word=True))
To match all <AGE>
tokens?
That wouldn't really handle the regex case, but I could enumerate a practical number of potential regex matches.
Indeed you can add tokens this way (add_tokens
expects a list, but that's the idea). These won't be tokenized and will be returned as given. We don't provide any way to add regex at the moment though.
Thanks, I'm able to match the processed representation of the added tokens.
Hi, Oh that's exactly my question, I didn't see this issue when posting. I'm still at a loss as to how to handle these special tokens. When I try adding them, the tokenizing does not change:
from tokenizers import ByteLevelBPETokenizer, AddedToken
tokenizer = ByteLevelBPETokenizer(
vocab_file="encoder.json",
merges_file="vocab.bpe",
)
tokenizer.add_tokens([AddedToken("<|endoftext|>")]) # same result with: tokenizer.add_tokens(["<|endoftext|>"]) -> 1
tokenizer.encode("<|endoftext|>").ids # [30, 94, 418, 8316, 20340, 94, 32]
# the token is present in the dictionary:
tokenizer.get_vocab()["<|endoftext|>"] # 2
We are currently using subword-nmt bpe tokenizer for a job and are using its "Glossary" parameters to be able ignore certain symbols using regular expressions.
I understand that Tokenizers has the ability to specify special tokens, but these are removed during decoding.
Is there any good way to add a glossary a la subword-nmt bpe, using regexes that will be left alone on encode and decode.
https://github.com/rsennrich/subword-nmt