huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
8.88k stars 765 forks source link

BPE trainer ignoring special tokens. #1616

Open henrycharlesworth opened 3 weeks ago

henrycharlesworth commented 3 weeks ago

I am trying to train a custom tokenizer. My use case is related to assembly code, so I want merges to be possible across full instructions (potentially multiple "words"). To do this, I am replacing all spaces with a dummy token (e.g. "<space>"), and have a pretokenizer that splits on "\n". This basically works, but my issue comes when I try to add in special tokens. The following is a simple example to reproduce the issue:

from tokenizers import Tokenizer, Regex
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Sequence as PretokenizerSequence, Split
from tokenizers.normalizers import Sequence as NormalizerSequence, Replace, BertNormalizer, Strip

corpus_file = "corpus.txt"
special_tokens = [
    "<s>",
    "<pad>",
    "</s>",
    "<unk>"
]
for i in range(20):
    special_tokens.append(f"<disasm_function_{i}>")
    special_tokens.append(f"<disasm_string_{i}>")

tokenizer = Tokenizer(BPE())
tokenizer.add_special_tokens(special_tokens)

tokenizer.normalizer = NormalizerSequence([
    Strip(),
    BertNormalizer(clean_text=True, strip_accents=True, lowercase=True),
    Replace(Regex("\s{2,}"), " "),
    Replace(" ", "<space>")
])
tokenizer.pre_tokenizer = PretokenizerSequence([
    Split("\n", behavior="removed")
])

trainer = BpeTrainer(
    special_tokens=special_tokens, vocab_size=10000, min_frequency=2,
)
tokenizer.train(files=[corpus_file], trainer=trainer)

tokenizer.save("example_tokenizer.json")

An example segment of my corpus I am using to train will look something like:

lea rsi,<code_addr_1> <string_literal><disasm_string_0></string_literal> <eoi>
mov edi, eax <eoi>
call <external>::<function_name><disasm_function_1></function_name> <eoi>
mov rax, qword ptr <local_var_0> <eoi>
mov rdi, rax <eoi>
call <external>::<function_name><disasm_function_2></function_name> <eoi>
mov rax, qword ptr <local_var_0> <eoi>
mov rax, qword ptr [rax]<unk_0> <eoi>
mov rdi, rax <eoi>
call <external>::<function_name><disasm_function_3></function_name> <eoi>

so the aim is to ensure that e.g. is always a single token. This works at test time (i.e. these special tokens are always tokenized as single tokens), but it's clearly not happening during the BPE training. If I examine the tokens/merges I am getting out, many of them contain the special tokens within them. E.g. from the resulting JSON file:

"</return_val><space><calling_conv>stdcall</calling_conv><func_name><disasm_function_0></func_name><parameters>(": 370,
      "pop<space>r1": 371,
      "call<space><external>::<function_name><disasm_function_2></function_name><space><eoi>": 372,

you can see these learned tokens contain the special tokens within them.

Is this expected behaviour? My assumption was that the BPE trainer would prevent this from happening (as I provide it with a list of the special tokens - why else would it need this argument)? And it's not very desirable to fill up the vocab with lots of merges that aren't ever going to be valid.

Is there anyway to stop this from happening (or is there anything that I haven't set up properly?)

EDIT:

My current horrible workaround is to do:

tokenizer.pre_tokenizer = PretokenizerSequence([
    Split("\n", behavior="removed")
] + [Split(tok, behavior="isolated") for tok in special_tokens])

which seems to work, but can't be the best way.

ArthurZucker commented 3 weeks ago

Hey! you are adding the tokens before initializing the normalizer, this worked for me:

from tokenizers import Tokenizer, Regex
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Sequence as PretokenizerSequence, Split
from tokenizers.normalizers import Sequence as NormalizerSequence, Replace, BertNormalizer, Strip

corpus_file = "corpus.txt"
special_tokens = [
    "<s>",
    "<pad>",
    "</s>",
    "<unk>"
]
for i in range(20):
    special_tokens.append(f"<disasm_function_{i}>")
    special_tokens.append(f"<disasm_string_{i}>")

tokenizer = Tokenizer(BPE())
- tokenizer.add_special_tokens(special_tokens)

tokenizer.normalizer = NormalizerSequence([
    Strip(),
    BertNormalizer(clean_text=True, strip_accents=True, lowercase=True),
    Replace(Regex("\s{2,}"), " "),
    Replace(" ", "<space>")
])
tokenizer.pre_tokenizer = PretokenizerSequence([
    Split("\n", behavior="removed")
])
+ tokenizer.add_special_tokens(special_tokens)
trainer = BpeTrainer(
    special_tokens=special_tokens, vocab_size=10000, min_frequency=2,
)
tokenizer.train(files=[corpus_file], trainer=trainer)

tokenizer.save("example_tokenizer.json")
henrycharlesworth commented 3 weeks ago

So I tried this and for me it still gives exactly the same result. It works at test time (as did the previous version), but during training it is still merging across the special tokens.

ArthurZucker commented 3 weeks ago

You are right, sorry. Here is a PR with a fix, not sure why we never had that.