helpmefindaname / transformer-smaller-training-vocab

Temporary remove unused tokens during training to save ram and speed.
https://helpmefindaname.github.io/transformer-smaller-training-vocab/
MIT License
20 stars 2 forks source link

KeyError: '<unk>' #15

Closed david-waterworth closed 1 week ago

david-waterworth commented 4 weeks ago

I think this is a really clever idea, especially the use of a context manager. I tried it with a sentence-transformer model but I'm getting an error KeyError: ''

model = transformers.AutoModel.from_pretrained("sentence-transformers/all-mpnet-base-v2")
tokenizer = transformers.AutoTokenizer.from_pretrained("sentence-transformers/all-mpnet-base-v2")

with reduce_train_vocab(model=model, tokenizer=tokenizer, texts=["ABC", "123"]):
    pass

I'm also getting an error due to the use of torch.nn.modules.sparse.Embedding in sentence-transformer

helpmefindaname commented 2 weeks ago

Hi @david-waterworth thank you for this report. The issue lies in the tokenizer, when you print it you see:

MPNetTokenizerFast(name_or_path='sentence-transformers/all-mpnet-base-v2', vocab_size=30527, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '[UNK]', 'sep_token': '</s>', 'pad_token': '<pad>', 'cls_token': '<s>', 'mask_token': '<mask>'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
        0: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        1: AddedToken("<pad>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        3: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
        104: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        30526: AddedToken("<mask>", rstrip=False, lstrip=True, single_word=False, normalized=False, special=True),
}

there you see that there are added tokens that are considered special, but somehow do not end up in the special_tokens list from huggingface. This should be fixed now in: https://github.com/helpmefindaname/transformer-smaller-training-vocab/pull/16 Can you please try it out yourself?

sparse embeddings are not supported by this library, but I don't see them being used in the example you provided.

david-waterworth commented 2 weeks ago

Oh right yeah it looks like the original models special tokens are messed up, special_tokens has 'unk_token': '[UNK]' but added_tokens_decoder contains and [UNK]. The ordinal for [UNK] implies it was "discovered" during training rather than added as a special token (or it wouldn't be mixed into the vocab)!

I'll look closer at the sparse embedding, I wonder if the sentence-transformer trainer (which is derived from the huggingface trainer) is replacing the embeddings with a sparse version - seems odd though?

Also have you considered using this approach to "distill" a model - i.e. don't restore the oov embeddings after training, and instead somehow prune the merges in the tokeniser (assuming a BPE based model) so that only merges that lead to training vocab tokens are retained?

helpmefindaname commented 2 weeks ago

Also have you considered using this approach to "distill" a model - i.e. don't restore the oov embeddings after training, and instead somehow prune the merges in the tokeniser (assuming a BPE based model) so that only merges that lead to training vocab tokens are retained?

I have a friend that was continuous pre-training a model on a huge corpus, using only the reduced vocab. He didn't need to restore to the full vocab for 2 reasons:

It is easy to do this with this libary, when not using the decorator:

from transformer_smaller_training_vocab import reduce_train_vocab_and_context

texts = ...
model = ...
tokenizer = ...

reduce_train_vocab_and_context(model, tokenizer, texts)

However I recommend to consider:

But you might want to look into: https://github.com/asahi417/lm-vocab-trimmer

david-waterworth commented 2 weeks ago

Thanks, I'm really using a pre-trained model to bootstrap training on a very different corpus (building control sensor tags - which are labelled in a very idiosyncratic manner, containing little to no whitespace, lots of abbreviations etc). I found starting from a trained LM (with additional pre-tokenization regex's I "inject") is still way more effective than training my own despite only using ~5k tokens so the idea of a trimming down the vocab appeals! I have a quite large corpus (10m+ short texts) of which I usually fine tune with a very small fraction but can use the entire set to reduce the vocab - I'll see how it goes

david-waterworth commented 2 weeks ago

@helpmefindaname that works!

Gathered 5191 of total 30527
INFO:transformer_smaller_training_vocab:Gathered 5191 of total 30527
Reducing vocab size by 82.9954%
INFO:transformer_smaller_training_vocab:Reducing vocab size by 82.9954%
Reducing model size by 17.7721%
INFO:transformer_smaller_training_vocab:Reducing model size by 17.7721%
Reducing training parameter count by 17.7721%
INFO:transformer_smaller_training_vocab:Reducing training parameter count by 17.7721%

Not sure why I was also seeing a torch.nn.modules.sparse.Embedding error