Closed P-Sood closed 1 month ago
Hey π€ thanks a lot for opening an issue and using transformers!
We try to keep the github issues for bugs/feature requests. Could you ask your question on the forum instead? I'm sure the community will be of help!
Otherwise you should follow the tutorial ressources on how to train a whisper model see:
Thanks!
Hello @ArthurZucker I shall post it on the huggingface forums as you request.
I saw that second post with training on the custom tokenizer. However, the fix they used was to switch it back to the regular pretrained tokenizer and just train for longer. So that doesn't seem like it would have too much effect on me.
The other issue I looked at here was on the huggingface bugs page so I decided to post it here as well.
They also had a similar issue, but they needed help to get the model to train, and had no information on the results after the code was correct. Maybe I should leave a comment at the author of that issue, seeing if he got it work.
Anyways, thanks for the info, ill post it on the forums.
I am not sure why you need to train a new tokenizer but I don't recommend it. You are completely losing the mapping from input_ids and tokens, thus the preptrained model is rendered useless. You should add tokens to the tokenizers rather than train a new one from scratch if you want to leverage the pretrained checkpoint
Do you know ahead of time what the kind of jargon is? You could first try Whisper prompting by putting your 'jargon' as the prompt:
processor = WhisperProcessor.from_pretrained("openai/whisper-tiny")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny")
input_features = processor(input_speech, return_tensors="pt").input_features
# --- Without prompt ---
prompt_ids = processor.get_prompt_ids("Leighton")
output_without_prompt = model.generate(input_features)
print(processor.decode(output_without_prompt[0]))
# "<|startoftranscript|><|en|><|transcribe|><|notimestamps|> He has grave doubts whether Sir Frederick Layton's work is really Greek after all and can discover in it but little of Rocky Ithaca.<|endoftext|>"
# --- With prompt ---
prompt_ids = processor.get_prompt_ids("Leighton")
output_with_prompt = model.generate(input_features, prompt_ids=prompt_ids)
print(processor.decode(output_with_prompt[0]))
# "<|startofprev|> Leighton<|startoftranscript|><|en|><|transcribe|><|notimestamps|> He has grave doubts whether Sir Frederick Leighton's work is really Greek after all and can discover in it but little of Rocky Ithaca.<|endoftext|>"
Your next best method would be fine-tuning using the original tokenizer on your dataset, using as much data as possible: https://huggingface.co/blog/fine-tune-whisper
If you're in a low-data regime, freezing the encoder is recommended. Call this line before you do trainer.train()
:
model.freeze_encoder()
After that, see this issue for recommendations for custom vocabulary: https://discuss.huggingface.co/t/adding-custom-vocabularies-on-whisper/29311?u=nbroad. Note that this will require more data than standard fine-tuning, so you should be completely sure standard fine-tuning with the original tokenizer doesn't work before trying this. Also note that as @ArthurZucker mentioned, it is not recommended to completely reset the tokenizer, but rather append the new vocabulary to the tokenizer.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Hey! I would recommend you to use tokenizer.train_new_from_iterator
for example! https://huggingface.co/learn/nlp-course/en/chapter6/2 for more details
The issue is that you might have to train the model as well, which is much more compilcated
Is it possible to train a tokenizer using tokenizer.train_new_from_iterator() and avoid model training? Technically, yes, you can train a tokenizer using train_new_from_iterator() without re-training the model, but this usually isn't advisable. The reason is that the tokenizer and the model are tightly coupled. The model is trained on a specific vocabulary, which corresponds to the token IDs generated by the tokenizer. When you train a new tokenizer, the vocabulary and tokenization strategy change, which means that the tokens the model expects and those generated by the new tokenizer might not align. This misalignment leads to incorrect inputs to the model, which in turn can result in poor or nonsensical predictions.
Why did I get an empty prediction with the new tokenizer? You got an empty prediction because the new tokenizer's output doesn't match what the model was trained to process. When you changed the tokenizer, the token IDs and the sequence of tokens fed into the model were different from what the model expects. The model likely received token IDs or sequences it was never trained on, causing it to fail in generating any meaningful output, hence the empty prediction. Additionally, if the special tokens used by the model (like <|endoftext|>, <|startoftranscript|>, etc.) have different IDs in the new tokenizer, the model might misinterpret these tokens, leading to the generation of no output.
Is it okay that <|endoftext|> has different IDs in the old and new tokenizers? No, it's not okay if the model was trained with the assumption that <|endoftext|> has a specific token ID (like 50257) and now it has a different ID (like 0) in the new tokenizer. The model relies on specific token IDs to understand the input correctly. If the IDs are changed, the model's internal mechanisms (which depend on these IDs) will no longer function as intended. This misalignment can cause the model to either generate incorrect predictions or fail entirely, as seen in your case.
Is it okay to have extra special tokens in the new tokenizer's vocabulary? Having extra special tokens in the new tokenizer's vocabulary is fine if the model is designed to recognize and utilize these tokens. However, if the model wasn't trained with these special tokens, they will likely be ignored or cause issues. For instance, if the model encounters these tokens but doesn't know how to interpret them, it may fail to generate appropriate predictions. On the other hand, if these special tokens are necessary for the model's functionality (e.g., indicating language or specific tasks), then having them is crucial. The problem arises when there's a mismatch between the special tokens the model expects and those provided by the tokenizer.
Conclusion In summary, the root of the issues you're encountering is the misalignment between the tokenizer and the model. When you train a new tokenizer, the token IDs and tokenization strategies change, which can cause the model to malfunction if it was not retrained with this new tokenizer. For the best results, you should either retrain the model with the new tokenizer or, if retraining isn't feasible, stick to using the tokenizer that the model was originally trained with.
As I understand the "<|endoftext|>" special toke id must be the last one (or one of the last ones if other special tokens are used as well) in vocab.
This assumption is not necessarily true. The most important thing is that it stays at the same position if you want to re-use the tokenizer.
Now, when training the tokenizer, you don't need the special token. So you should add it afterwards or, give {"
Another thing is, you should not use the ByteLevel pretokenizer but the normalizer.
If you try to decoder 0, you will see that it will not be "<|endoftext|>"
π
FYI @itazap
Honestly it's a bit complicated π TLDR:
Would you mind sharing what unblocked you?! π€ I am super curious
(sorry my bad)
System Info
transformers
version: 4.35.2Who can help?
@sanchit-gandhi
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Hello, I want to take the audio at my workplace and transform it into a transcription; however, with base whisper, it seems as though it isn't that good. So, I have been wanting to create my own tokenizer that can understand jargon and output that jargon better. Stuff similar to acronyms. Below I have shown my steps in 1) Creating Tokenizer 2) Preprocessing data pipeline 3) Model init, and configuration 4) Model outputs
I run this using huggingface trainer, with the generate option. Is it my data size? i have scoured online to try and find some sort of solution but they all just say it works. I am at my wits end and would appreciate any help on getting this tokenizer to learn my jargon.
Thank you in advance :)
Creating the tokenizer
len(tokenizer) == 193
Preprocessing steps
len(train_dataset) == 4000
len(test_dataset) == 1000
Model Config
Huggingface Trainer
Here I have made the dataset the same 30 examples to see if it would give me complete overprediction, but even with setting train and test to be the same, it is not overfitting at all.
Outputs after second epoch
Expected behavior
More understandable text descriptions