huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
9.04k stars 799 forks source link

support for subword-nmt-style glossaries? #317

Closed timothyjlaurent closed 4 years ago

timothyjlaurent commented 4 years ago

We are currently using subword-nmt bpe tokenizer for a job and are using its "Glossary" parameters to be able ignore certain symbols using regular expressions.

I understand that Tokenizers has the ability to specify special tokens, but these are removed during decoding.

Is there any good way to add a glossary a la subword-nmt bpe, using regexes that will be left alone on encode and decode.

https://github.com/rsennrich/subword-nmt

support for glossaries: use the argument --glossaries for subword-nmt apply-bpe to provide a list of words and/or regular expressions that should always be passed to the output without subword segmentation

timothyjlaurent commented 4 years ago

Maybe it's as simple as

tokenizer.add_tokens(AddedToken("<AGE>", single_word=True)) 

To match all <AGE> tokens?

That wouldn't really handle the regex case, but I could enumerate a practical number of potential regex matches.

n1t0 commented 4 years ago

Indeed you can add tokens this way (add_tokens expects a list, but that's the idea). These won't be tokenized and will be returned as given. We don't provide any way to add regex at the moment though.

timothyjlaurent commented 4 years ago

Thanks, I'm able to match the processed representation of the added tokens.

jchwenger commented 4 years ago

Hi, Oh that's exactly my question, I didn't see this issue when posting. I'm still at a loss as to how to handle these special tokens. When I try adding them, the tokenizing does not change:

from tokenizers import ByteLevelBPETokenizer, AddedToken
tokenizer = ByteLevelBPETokenizer(                                                              
     vocab_file="encoder.json",                                                                
     merges_file="vocab.bpe", 
) 
tokenizer.add_tokens([AddedToken("<|endoftext|>")]) # same result with: tokenizer.add_tokens(["<|endoftext|>"]) -> 1
tokenizer.encode("<|endoftext|>").ids # [30, 94, 418, 8316, 20340, 94, 32]
# the token is present in the dictionary:
tokenizer.get_vocab()["<|endoftext|>"] # 2