`BPE` tokenization model does not respect custom `RegEx` via `Split` pre-tokenizer

hogru commented 11 months ago

I am trying to understand the following behaviour. I am not sure if it is a bug or I miss something.

For context, I want to to find the "best" tokenizer and therefore try a bunch of pre-tokenizers and tokenization models. The pre-tokenizer is always a (varying) RegEx and the tokenization models are WordLevel, BPE etc.

Please consider that I am still on tokenizers v0.13.4. since I've read about some breaking changes with v0.14.

`WordLevel` works as excpected

I build the Wordlevel tokenizer in the following way:

tokenizer = Tokenizer(WordLevel(unk_token=unk_token))
tokenizer.pre_tokenizer = Split(  
    pattern=regex_pattern, behavior="isolated", invert=False  
)
trainer = WordLevelTrainer(  
    special_tokens=special_tokens,  
    min_frequency=1,  
    show_progress=True,  
)  
tokenizer.train_from_iterator(train_source, trainer=trainer)

This results in the following vocabulary (excerpt from tokenizer.json):

"model": {
    "type": "WordLevel",
    "vocab": {
      "^": 0,
      "_": 1,
      " ": 2,
      "§": 3,
      "°": 4,
      "c": 5,
      "C": 6,
      "(": 7,
      ")": 8,
      "1": 9,
      "O": 10,
      "2": 11,
      "N": 12,
      "=": 13,
      "[": 14,
      "]": 15,
      "H": 16,
      "n": 17,
      "3": 18,
      "@@": 19,
      "@": 20,
      "F": 21,
      "+": 22,
      "-": 23,
      "S": 24,
      "Cl": 25,
      "s": 26,
      "o": 27,
      "4": 28,
      "/": 29,
      "#": 30,
      "Br": 31,
      "Sc": 32,
      "\\": 33,
      "5": 34,
      "I": 35,
      "-2": 36,
      "P": 37,
      "6": 38,
      "-3": 39,
      "-4": 40,
      "7": 41,
      "8": 42,
      "-1": 43,
      "-5": 44
    }

The vocab results in a vocab with a split performed by my RegEx. For example, please not the Br token in line 35 which resembles Bromine (we are in the chemistry domain).

So far, so good.

`BPE` works not as expected

Now, I am trying to combine the RegEx from above with the BPE trainer. For that, I build the tokenizer as follows:

tokenizer = Tokenizer(BPE(unk_token=unk_token))  # Change from WordLevel to BPE
tokenizer.pre_tokenizer = Split(
    pattern=regex_pattern, behavior="isolated", invert=False  
)  # Same as above
trainer = BpeTrainer(  
    vocab_size=vocab_size,  
    special_tokens=special_tokens,  
    min_frequency=1,  
    show_progress=True,  
)  ## Change from WordLevelTrainer to BpeTrainer and add vocab_size parameter

My expectation at this point would be to keep the tokens from the WordLevelTrainer, e.g. Br and get additional merges from the BPE algorithm. This is the resulting vocabulary (again, excerpt from tokenizer.json):

"vocab": {
      "^": 0,
      "_": 1,
      " ": 2,
      "§": 3,
      "°": 4,
      "#": 5,
      "(": 6,
      ")": 7,
      "+": 8,
      "-": 9,
      "/": 10,
      "1": 11,
      "2": 12,
      "3": 13,
      "4": 14,
      "5": 15,
      "6": 16,
      "7": 17,
      "8": 18,
      "=": 19,
      "@": 20,
      "B": 21,
      "C": 22,
      "F": 23,
      "H": 24,
      "I": 25,
      "N": 26,
      "O": 27,
      "P": 28,
      "S": 29,
      "[": 30,
      "\\": 31,
      "]": 32,
      "c": 33,
      "l": 34,
      "n": 35,
      "o": 36,
      "r": 37,
      "s": 38,
      "@@": 39,
      "Cl": 40,
      "Br": 41,
      "Sc": 42,
      "-2": 43,
      "-3": 44,
      "-4": 45,
      "-1": 46,
      "-5": 47
    },
    "merges": [
      "@ @",
      "C l",
      "B r",
      "S c",
      "2",
      "3",
      "4",
      "1",
      "5"
    ]

As we can see the tokens created from the RegExare gone, i.e. Br is split into B on line 21 and r on line 37. It is the BPE algorithm that merges it and adds it to the vocabulary (line 54).

I would like to see Br as a token created by my RegEx and have it potentially being merged with other (more complicted tokens, not described here), e.g. Br and [C@@H].

Ideally I can fix that behavior (which might work and I just don't know how to do it, all my variations of the code did not give me the results I want) or understand why this happens / is not possible.

Thanks for reading until here :-)

ArthurZucker commented 11 months ago

The thing is even if you split the text with a regex, each split is probably gonna be processed afterwards, and will thus be subject to the underlying model. If there are tokens that you want to isolate / make sure are not split you should add them to the vocabulary using tokenizer.add_tokens(["Br"]) which in this case will never be seen by the model

ArthurZucker commented 11 months ago

Would that work for you ?

hogru commented 11 months ago

Hi @ArthurZucker, thank you for hinting me in this direction. Correct me if I do not fully get the intention of your suggestion, but I think I can't simply add a token.

Assumption: from what I know, add_tokens() is (mainly) used to add tokens to a pre-trained tokenizer in order to allow for special words; so I could take a pre-trained LLM and add a company-/domain-specific vocabulary
In my case, I train the tokenizer from scratch and I do not have a pre-built list of tokens to add (they are built during the experiment, depending on the RegEx)
I theoretically could (a) always train the Wordlevel tokenizer, (b) read its tokens and (c) add it to the (trained) BPE tokenizer; but then I have tokens I do not want ("B" and "r" in my case)
What potentially might work: (a) train the Wordlevel tokenizer, (b) create an "empty" BPE tokenizer, (c) add the tokens from the Wordlevel tokenizer and add it to the BPE tokenizer and (d) train the BPE tokenizer
I don't know if this still would allow me to control the vocab_size of the tokenizer which is one of my hyper parameters

Before I try this... any additional thoughts/suggestions?

hogru commented 11 months ago

Hi @ArthurZucker, I tried (my understanding of) your approach and ultimately failed at it. I would therefore appreciate it if you could take a look and tell me whether I miss something.

Btw, I upgraded to tokenizers v0.14.1 if this is relevant.

The code I tried is this:

# First, I train with my RegEx and a WordLevel Trainer as this results in the vocab I want
wordlevel_tokenizer = Tokenizer(WordLevel(unk_token=unk_token))
wordlevel_tokenizer.pre_tokenizer = Split(
    pattern=regex_pattern, behavior="isolated", invert=False
)
trainer = WordLevelTrainer(
    special_tokens=special_tokens,
    min_frequency=1,
    show_progress=True,
)
wordlevel_tokenizer.train_from_iterator(train_source, trainer=trainer)

# At this point, I have a tokenizer with the vocab I want
# Then, I build a new BPE tokenizer and try to prepopulate it with the vocab from the tokenizer above
bpe_tokenizer = Tokenizer(BPE(unk_token=unk_token))
vocab_wordlevel = list(wordlevel_tokenizer.get_vocab().keys())  # For this example I skip any mgmt of special tokens
bpe_tokenizer.add_tokens(vocab_wordlevel)

# Add this point, in the debugger I can see the correct/expected vocab when I run `bpe_tokenizer.get_vocab()`
# Finally, I finish training of the BPE tokenizer
# Not sure if the pre-tokenizer is necessary at this point; tried with and without, neither works
# tokenizer.pre_tokenizer = Split(
#     pattern=regex_pattern, behavior="isolated", invert=False
# )
trainer = BpeTrainer(
    vocab_size=vocab_size,
    special_tokens=special_tokens,
    min_frequency=min_frequency,
    show_progress=True,
)
bpe_tokenizer.train_from_iterator(train_source, trainer=trainer)
# Now, the vocab is wrong

I can see the added tokens in tokenizer.json, for example:

    {
      "id": 4,
      "content": "°",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false,
      "special": true
    },
    {
      "id": 58,
      "content": "Br",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": true,
      "special": false
    },

But the resulting vocabulary is not as expected, it looks as if I had never added the tokens upfront. From the initial example, I would expect a "native/original" Br token, as it is added from with WordLevel vocab, not from BPE merging the tokens B and r.

"model": {
    "type": "BPE",
    "dropout": null,
    "unk_token": "§",
    "continuing_subword_prefix": null,
    "end_of_word_suffix": null,
    "fuse_unk": false,
    "byte_fallback": false,
    "vocab": {
        "^": 0,
        "_": 1,
        " ": 2,
        "§": 3,
        "°": 4,
        "#": 5,
        "(": 6,
        ")": 7,
        "+": 8,
        "-": 9,
        "/": 10,
        "1": 11,
        "2": 12,
        "3": 13,
        "4": 14,
        "5": 15,
        "6": 16,
        "7": 17,
        "=": 18,
        "@": 19,
        "B": 20,
        "C": 21,
        "F": 22,
        "H": 23,
        "I": 24,
        "N": 25,
        "O": 26,
        "S": 27,
        "[": 28,
        "\\": 29,
        "]": 30,
        "c": 31,
        "l": 32,
        "n": 33,
        "o": 34,
        "r": 35,
        "s": 36,
        "@@": 37,
        "Cl": 38,
        "Br": 39,
        "Sc": 40,
        "-3": 41
    },
    "merges": [
        "@ @",
        "C l",
        "B r",
        "S c",
        "- 3"
    ]
}

I am not sure what I am doing wrong here. I would appreciate any help.

ArthurZucker commented 10 months ago

Hey! When you add a token using tokenizer.add_tokens it does not add the token to the vocab of the model but to the added_vocab of the tokenizer. Thus it's expected that it does not appear in the vocab. If you print:

tokenizer.get_added_tokens_decoder()

you'll see the tokens that were added. If it's properly added, it will not be a result of a merge

github-actions[bot] commented 9 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

hogru commented 9 months ago

Ok, need to investigate this further. Maybe I need to sync wording/meaning. In my understanding, the model (in this case BPE) is part of the tokenizer, as documented here. Besides this discussion about the meaning of words, I think the practical test should be that I try to encode B or r. If the tokenizer/model works “correctly” (in this context, that is, what I am trying to achieve), it should not be able to encode it since Br (39) is the only correct token. But from looking at my example config, I assume that it could encode the B as 20 and the r as 35. Will try out.

github-actions[bot] commented 8 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

huggingface / tokenizers