KeyError: '<unk>' - Githubissues

david-waterworth commented 4 weeks ago

I think this is a really clever idea, especially the use of a context manager. I tried it with a sentence-transformer model but I'm getting an error KeyError: ''

model = transformers.AutoModel.from_pretrained("sentence-transformers/all-mpnet-base-v2")
tokenizer = transformers.AutoTokenizer.from_pretrained("sentence-transformers/all-mpnet-base-v2")

with reduce_train_vocab(model=model, tokenizer=tokenizer, texts=["ABC", "123"]):
    pass

I'm also getting an error due to the use of torch.nn.modules.sparse.Embedding in sentence-transformer

helpmefindaname commented 2 weeks ago

Hi @david-waterworth thank you for this report. The issue lies in the tokenizer, when you print it you see:

MPNetTokenizerFast(name_or_path='sentence-transformers/all-mpnet-base-v2', vocab_size=30527, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '[UNK]', 'sep_token': '</s>', 'pad_token': '<pad>', 'cls_token': '<s>', 'mask_token': '<mask>'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
        0: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        1: AddedToken("<pad>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        3: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
        104: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        30526: AddedToken("<mask>", rstrip=False, lstrip=True, single_word=False, normalized=False, special=True),
}

there you see that there are added tokens that are considered special, but somehow do not end up in the special_tokens list from huggingface. This should be fixed now in: https://github.com/helpmefindaname/transformer-smaller-training-vocab/pull/16 Can you please try it out yourself?

sparse embeddings are not supported by this library, but I don't see them being used in the example you provided.

david-waterworth commented 2 weeks ago

Oh right yeah it looks like the original models special tokens are messed up, special_tokens has 'unk_token': '[UNK]' but added_tokens_decoder contains and [UNK]. The ordinal for [UNK] implies it was "discovered" during training rather than added as a special token (or it wouldn't be mixed into the vocab)!

I'll look closer at the sparse embedding, I wonder if the sentence-transformer trainer (which is derived from the huggingface trainer) is replacing the embeddings with a sparse version - seems odd though?

Also have you considered using this approach to "distill" a model - i.e. don't restore the oov embeddings after training, and instead somehow prune the merges in the tokeniser (assuming a BPE based model) so that only merges that lead to training vocab tokens are retained?

helpmefindaname commented 2 weeks ago

Also have you considered using this approach to "distill" a model - i.e. don't restore the oov embeddings after training, and instead somehow prune the merges in the tokeniser (assuming a BPE based model) so that only merges that lead to training vocab tokens are retained?

I have a friend that was continuous pre-training a model on a huge corpus, using only the reduced vocab. He didn't need to restore to the full vocab for 2 reasons:

his pre-training corpus was so big, that it would never happen that he reaches an oov-token when ran in production.
he was using an ocr with limited characters, so he could proof that a lot of oov-tokens could never be found in production.

It is easy to do this with this libary, when not using the decorator:

from transformer_smaller_training_vocab import reduce_train_vocab_and_context

texts = ...
model = ...
tokenizer = ...

reduce_train_vocab_and_context(model, tokenizer, texts)

However I recommend to consider:

usually you finetune on a small or mid-sized dataset, that won't contain all the tokens that you will reach when using the model in production
depending on the tokenizer, oov tokens might either get you to bad results, or simply raise an exception

But you might want to look into: https://github.com/asahi417/lm-vocab-trimmer

david-waterworth commented 2 weeks ago

Thanks, I'm really using a pre-trained model to bootstrap training on a very different corpus (building control sensor tags - which are labelled in a very idiosyncratic manner, containing little to no whitespace, lots of abbreviations etc). I found starting from a trained LM (with additional pre-tokenization regex's I "inject") is still way more effective than training my own despite only using ~5k tokens so the idea of a trimming down the vocab appeals! I have a quite large corpus (10m+ short texts) of which I usually fine tune with a very small fraction but can use the entire set to reduce the vocab - I'll see how it goes

david-waterworth commented 2 weeks ago

@helpmefindaname that works!

Gathered 5191 of total 30527
INFO:transformer_smaller_training_vocab:Gathered 5191 of total 30527
Reducing vocab size by 82.9954%
INFO:transformer_smaller_training_vocab:Reducing vocab size by 82.9954%
Reducing model size by 17.7721%
INFO:transformer_smaller_training_vocab:Reducing model size by 17.7721%
Reducing training parameter count by 17.7721%
INFO:transformer_smaller_training_vocab:Reducing training parameter count by 17.7721%

Not sure why I was also seeing a torch.nn.modules.sparse.Embedding error

helpmefindaname / transformer-smaller-training-vocab

KeyError: '<unk>' #15