Closed hogru closed 8 months ago
The thing is even if you split the text with a regex, each split is probably gonna be processed afterwards, and will thus be subject to the underlying model. If there are tokens that you want to isolate / make sure are not split you should add them to the vocabulary using tokenizer.add_tokens(["Br"])
which in this case will never be seen by the model
Would that work for you ?
Hi @ArthurZucker, thank you for hinting me in this direction. Correct me if I do not fully get the intention of your suggestion, but I think I can't simply add a token.
add_tokens()
is (mainly) used to add tokens to a pre-trained tokenizer in order to allow for special words; so I could take a pre-trained LLM and add a company-/domain-specific vocabularyRegEx
)Wordlevel
tokenizer, (b) read its tokens and (c) add it to the (trained) BPE
tokenizer; but then I have tokens I do not want ("B" and "r" in my case)Wordlevel
tokenizer, (b) create an "empty" BPE
tokenizer, (c) add the tokens from the Wordlevel
tokenizer and add it to the BPE
tokenizer and (d) train the BPE
tokenizervocab_size
of the tokenizer which is one of my hyper parametersBefore I try this... any additional thoughts/suggestions?
Hi @ArthurZucker, I tried (my understanding of) your approach and ultimately failed at it. I would therefore appreciate it if you could take a look and tell me whether I miss something.
Btw, I upgraded to tokenizers
v0.14.1 if this is relevant.
The code I tried is this:
# First, I train with my RegEx and a WordLevel Trainer as this results in the vocab I want
wordlevel_tokenizer = Tokenizer(WordLevel(unk_token=unk_token))
wordlevel_tokenizer.pre_tokenizer = Split(
pattern=regex_pattern, behavior="isolated", invert=False
)
trainer = WordLevelTrainer(
special_tokens=special_tokens,
min_frequency=1,
show_progress=True,
)
wordlevel_tokenizer.train_from_iterator(train_source, trainer=trainer)
# At this point, I have a tokenizer with the vocab I want
# Then, I build a new BPE tokenizer and try to prepopulate it with the vocab from the tokenizer above
bpe_tokenizer = Tokenizer(BPE(unk_token=unk_token))
vocab_wordlevel = list(wordlevel_tokenizer.get_vocab().keys()) # For this example I skip any mgmt of special tokens
bpe_tokenizer.add_tokens(vocab_wordlevel)
# Add this point, in the debugger I can see the correct/expected vocab when I run `bpe_tokenizer.get_vocab()`
# Finally, I finish training of the BPE tokenizer
# Not sure if the pre-tokenizer is necessary at this point; tried with and without, neither works
# tokenizer.pre_tokenizer = Split(
# pattern=regex_pattern, behavior="isolated", invert=False
# )
trainer = BpeTrainer(
vocab_size=vocab_size,
special_tokens=special_tokens,
min_frequency=min_frequency,
show_progress=True,
)
bpe_tokenizer.train_from_iterator(train_source, trainer=trainer)
# Now, the vocab is wrong
I can see the added tokens in tokenizer.json
, for example:
{
"id": 4,
"content": "°",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 58,
"content": "Br",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": true,
"special": false
},
But the resulting vocabulary is not as expected, it looks as if I had never added the tokens upfront. From the initial example, I would expect a "native/original" Br
token, as it is added from with WordLevel
vocab, not from BPE merging the tokens B
and r
.
"model": {
"type": "BPE",
"dropout": null,
"unk_token": "§",
"continuing_subword_prefix": null,
"end_of_word_suffix": null,
"fuse_unk": false,
"byte_fallback": false,
"vocab": {
"^": 0,
"_": 1,
" ": 2,
"§": 3,
"°": 4,
"#": 5,
"(": 6,
")": 7,
"+": 8,
"-": 9,
"/": 10,
"1": 11,
"2": 12,
"3": 13,
"4": 14,
"5": 15,
"6": 16,
"7": 17,
"=": 18,
"@": 19,
"B": 20,
"C": 21,
"F": 22,
"H": 23,
"I": 24,
"N": 25,
"O": 26,
"S": 27,
"[": 28,
"\\": 29,
"]": 30,
"c": 31,
"l": 32,
"n": 33,
"o": 34,
"r": 35,
"s": 36,
"@@": 37,
"Cl": 38,
"Br": 39,
"Sc": 40,
"-3": 41
},
"merges": [
"@ @",
"C l",
"B r",
"S c",
"- 3"
]
}
I am not sure what I am doing wrong here. I would appreciate any help.
Hey! When you add a token using tokenizer.add_tokens
it does not add the token to the vocab of the model but to the added_vocab
of the tokenizer. Thus it's expected that it does not appear in the vocab.
If you print:
tokenizer.get_added_tokens_decoder()
you'll see the tokens that were added. If it's properly added, it will not be a result of a merge
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
Ok, need to investigate this further. Maybe I need to sync wording/meaning. In my understanding, the model (in this case BPE) is part of the tokenizer, as documented here. Besides this discussion about the meaning of words, I think the practical test should be that I try to encode B
or r
. If the tokenizer/model works “correctly” (in this context, that is, what I am trying to achieve), it should not be able to encode it since Br
(39) is the only correct token. But from looking at my example config, I assume that it could encode the B
as 20 and the r
as 35. Will try out.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
I am trying to understand the following behaviour. I am not sure if it is a bug or I miss something.
For context, I want to to find the "best" tokenizer and therefore try a bunch of pre-tokenizers and tokenization models. The pre-tokenizer is always a (varying)
RegEx
and the tokenization models areWordLevel
,BPE
etc.Please consider that I am still on
tokenizers
v0.13.4. since I've read about some breaking changes with v0.14.WordLevel
works as excpectedI build the
Wordlevel
tokenizer in the following way:This results in the following vocabulary (excerpt from
tokenizer.json
):The vocab results in a vocab with a split performed by my
RegEx
. For example, please not theBr
token in line 35 which resembles Bromine (we are in the chemistry domain).So far, so good.
BPE
works not as expectedNow, I am trying to combine the
RegEx
from above with theBPE
trainer. For that, I build the tokenizer as follows:My expectation at this point would be to keep the tokens from the
WordLevelTrainer
, e.g.Br
and get additional merges from the BPE algorithm. This is the resulting vocabulary (again, excerpt fromtokenizer.json
):As we can see the tokens created from the
RegEx
are gone, i.e.Br
is split intoB
on line 21 andr
on line 37. It is the BPE algorithm that merges it and adds it to the vocabulary (line 54).I would like to see
Br
as a token created by myRegEx
and have it potentially being merged with other (more complicted tokens, not described here), e.g.Br
and[C@@H]
.Ideally I can fix that behavior (which might work and I just don't know how to do it, all my variations of the code did not give me the results I want) or understand why this happens / is not possible.
Thanks for reading until here :-)