Closed david-waterworth closed 1 week ago
Hi @david-waterworth thank you for this report. The issue lies in the tokenizer, when you print it you see:
MPNetTokenizerFast(name_or_path='sentence-transformers/all-mpnet-base-v2', vocab_size=30527, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '[UNK]', 'sep_token': '</s>', 'pad_token': '<pad>', 'cls_token': '<s>', 'mask_token': '<mask>'}, clean_up_tokenization_spaces=True), added_tokens_decoder={
0: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
1: AddedToken("<pad>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
3: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
104: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
30526: AddedToken("<mask>", rstrip=False, lstrip=True, single_word=False, normalized=False, special=True),
}
there you see that there are added tokens that are considered special, but somehow do not end up in the special_tokens
list from huggingface.
This should be fixed now in: https://github.com/helpmefindaname/transformer-smaller-training-vocab/pull/16
Can you please try it out yourself?
sparse embeddings are not supported by this library, but I don't see them being used in the example you provided.
Oh right yeah it looks like the original models special tokens are messed up, special_tokens
has 'unk_token': '[UNK]'
but added_tokens_decoder
contains
I'll look closer at the sparse embedding, I wonder if the sentence-transformer trainer (which is derived from the huggingface trainer) is replacing the embeddings with a sparse version - seems odd though?
Also have you considered using this approach to "distill" a model - i.e. don't restore the oov embeddings after training, and instead somehow prune the merges in the tokeniser (assuming a BPE based model) so that only merges that lead to training vocab tokens are retained?
Also have you considered using this approach to "distill" a model - i.e. don't restore the oov embeddings after training, and instead somehow prune the merges in the tokeniser (assuming a BPE based model) so that only merges that lead to training vocab tokens are retained?
I have a friend that was continuous pre-training a model on a huge corpus, using only the reduced vocab. He didn't need to restore to the full vocab for 2 reasons:
It is easy to do this with this libary, when not using the decorator:
from transformer_smaller_training_vocab import reduce_train_vocab_and_context
texts = ...
model = ...
tokenizer = ...
reduce_train_vocab_and_context(model, tokenizer, texts)
However I recommend to consider:
But you might want to look into: https://github.com/asahi417/lm-vocab-trimmer
Thanks, I'm really using a pre-trained model to bootstrap training on a very different corpus (building control sensor tags - which are labelled in a very idiosyncratic manner, containing little to no whitespace, lots of abbreviations etc). I found starting from a trained LM (with additional pre-tokenization regex's I "inject") is still way more effective than training my own despite only using ~5k tokens so the idea of a trimming down the vocab appeals! I have a quite large corpus (10m+ short texts) of which I usually fine tune with a very small fraction but can use the entire set to reduce the vocab - I'll see how it goes
@helpmefindaname that works!
Gathered 5191 of total 30527
INFO:transformer_smaller_training_vocab:Gathered 5191 of total 30527
Reducing vocab size by 82.9954%
INFO:transformer_smaller_training_vocab:Reducing vocab size by 82.9954%
Reducing model size by 17.7721%
INFO:transformer_smaller_training_vocab:Reducing model size by 17.7721%
Reducing training parameter count by 17.7721%
INFO:transformer_smaller_training_vocab:Reducing training parameter count by 17.7721%
Not sure why I was also seeing a torch.nn.modules.sparse.Embedding
error
I think this is a really clever idea, especially the use of a context manager. I tried it with a sentence-transformer model but I'm getting an error KeyError: ''
I'm also getting an error due to the use of
torch.nn.modules.sparse.Embedding
in sentence-transformer