Whisper is not learning a new tokenizer, even when i make test and train dataset the same

P-Sood commented 11 months ago

System Info

transformers version: 4.35.2
Platform: Linux-5.15.120+-x86_64-with-glibc2.35
Python version: 3.10.12
Huggingface_hub version: 0.19.3
Safetensors version: 0.4.0
Accelerate version: 0.24.1
Accelerate config: not found
PyTorch version (GPU?): 2.1.0+cu118 (False)
Tensorflow version (GPU?): 2.14.0 (False)
Flax version (CPU?/GPU?/TPU?): 0.7.5 (cpu)
Jax version: 0.4.20
JaxLib version: 0.4.20
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

Who can help?

@sanchit-gandhi

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

Hello, I want to take the audio at my workplace and transform it into a transcription; however, with base whisper, it seems as though it isn't that good. So, I have been wanting to create my own tokenizer that can understand jargon and output that jargon better. Stuff similar to acronyms. Below I have shown my steps in 1) Creating Tokenizer 2) Preprocessing data pipeline 3) Model init, and configuration 4) Model outputs

I run this using huggingface trainer, with the generate option. Is it my data size? i have scoured online to try and find some sort of solution but they all just say it works. I am at my wits end and would appreciate any help on getting this tokenizer to learn my jargon.

Thank you in advance :)

Creating the tokenizer

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers
from transformers import WhisperTokenizer

# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE())

# Pre-tokenizer responsible for converting the text to a stream of characters
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()#ByteLevel(add_prefix_space=False)

# Decoder responsible for converting the tokens back to a string
tokenizer.decoder = decoders.ByteLevel()

# Trainer responsible for training the BPE model
tokenizer.trainers = trainers.BpeTrainer(vocab_size=1000, min_frequency=2 , special_tokens=spec_tok)

# Training the tokenizer
tokenizer.train(["file.txt"])

# Save the tokenizer
tokenizer.save("NewWhisperTokenizer.json")

f = open('NewWhisperTokenizer.json')

# returns JSON object as
# a dictionary
data = json.load(f)
with open("vocab.json", "w") as outfile:
    json.dump(data['model']['vocab'], outfile)
with open("merges.txt", "w") as outfile:
    json.dump(data['model']['merges'], outfile)

tokenizer = WhisperTokenizer("vocab.json", "merges.txt" , errors = "replace", unk_token = "<|endoftext|>", bos_token = "<|endoftext|>", eos_token = "<|endoftext|>", pad_token = "<|endoftext|>")
tokenizer.add_special_tokens(WhisperTokenizer.from_pretrained("openai/whisper-tiny").special_tokens_map_extended)
tokenizer.save_pretrained("new_tok")

len(tokenizer) == 193

Preprocessing steps

def prepare_dataset(batch):
    audio = batch["audio"]
    batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]
    temp_labels = tokenizer(batch["phonetic_detail"]["utterance"]).input_ids
    batch["label"] = [label for sentence_labels in temp_labels for label in sentence_labels]
    return batch

@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    tokenizer: Any
    feature_extractor: Any
    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        batch = self.feature_extractor.pad(input_features, return_tensors="pt")
        label_features = [{"input_ids": feature["label"]} for feature in features]
        labels_batch = self.tokenizer.pad(label_features, return_tensors="pt")

        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)
        if (labels[:, 0] == self.tokenizer.bos_token_id).all().cpu().item():
            labels = labels[:, 1:]
        batch["labels"] = labels
        return batch

data_collator = DataCollatorSpeechSeq2SeqWithPadding(tokenizer , feature_extractor)

len(train_dataset) == 4000 len(test_dataset) == 1000

Model Config

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

voc = tokenizer.get_vocab()

model_Gen = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny")
model_Gen = model_Gen.to(device)

model_Gen.resize_token_embeddings(len(tokenizer))

model_Gen.config.pad_token_id = tokenizer.pad_token_id
model_Gen.config.decoder_start_token_id = voc['<|startoftranscript|>']
model_Gen.config.eos_token_id = tokenizer.eos_token_id
model_Gen.config.bos_token_id = tokenizer.bos_token_id
model_Gen.config.suppress_tokens = []
model_Gen.config.forced_decoder_ids = None
model_Gen.config.begin_suppress_tokens = [
    tokenizer.pad_token_id
  ]

model_Gen.generation_config.pad_token_id = tokenizer.pad_token_id
model_Gen.generation_config.decoder_start_token_id = voc['<|startoftranscript|>']
model_Gen.generation_config.eos_token_id = tokenizer.eos_token_id
model_Gen.generation_config.bos_token_id = tokenizer.bos_token_id
model_Gen.generation_config.suppress_tokens = []
model_Gen.generation_config.forced_decoder_ids = None
model_Gen.generation_config.begin_suppress_tokens = [
    tokenizer.pad_token_id
  ]

model_Gen.generation_config.no_timestamps_token_id = voc['<|notimestamps|>']

Huggingface Trainer

Here I have made the dataset the same 30 examples to see if it would give me complete overprediction, but even with setting train and test to be the same, it is not overfitting at all.

training_args = Seq2SeqTrainingArguments(
  output_dir='training_output',
  logging_dir='./logs',
  group_by_length=True,
  per_device_train_batch_size=1,
  gradient_accumulation_steps=1,  # increase by 2x for every 2x decrease in batch size
  per_device_eval_batch_size=1,
  num_train_epochs=8,
  gradient_checkpointing=True,
  lr_scheduler_type = "cosine_with_restarts",
  save_strategy='epoch',
  evaluation_strategy='epoch',
  logging_strategy='epoch',
  learning_rate=1e-2,
  weight_decay=0.005,
  # warmup_steps=36,
  save_total_limit=4,
  push_to_hub=False,
  predict_with_generate=True,
  generation_max_length=225,
  load_best_model_at_end=True,
  greater_is_better=False,
  generation_num_beams = 4,
  # fp16 = True,

  report_to="wandb", # Turn this off for pdb debug

)

trainer = CustomTrainer(
    compute_metrics=compute_metrics,
    args=training_args,
    model=model_Gen,
    data_collator=data_collator,
    tokenizer=processor.feature_extractor,
    train_dataset=new_test['test'] ,
    eval_dataset=new_test['test'],
)

trainer.evaluate()

Outputs after second epoch

tokenizer.batch_decode(pred.predictions , skip_special_tokens = True)
['', '', 'uwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuw', 'k', '', 'k', 'kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk', 
'awawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawaw', 'awawawaw', '', '', '', 'jjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj', '', 'jjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj', 'uweuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuwuw', '', 
'axaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxaxax', '', 
'kuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhkuhk', 
'eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee', 
'eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee',
 'awawawaw', 
'eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee', 
'awawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawawaw',
 '', 
'jjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj']

Expected behavior

More understandable text descriptions

ArthurZucker commented 11 months ago

Hey 🤗 thanks a lot for opening an issue and using transformers!

We try to keep the github issues for bugs/feature requests. Could you ask your question on the forum instead? I'm sure the community will be of help!

Otherwise you should follow the tutorial ressources on how to train a whisper model see:

Thanks!

P-Sood commented 11 months ago

Hello @ArthurZucker I shall post it on the huggingface forums as you request.

I saw that second post with training on the custom tokenizer. However, the fix they used was to switch it back to the regular pretrained tokenizer and just train for longer. So that doesn't seem like it would have too much effect on me.

The other issue I looked at here was on the huggingface bugs page so I decided to post it here as well.

They also had a similar issue, but they needed help to get the model to train, and had no information on the results after the code was correct. Maybe I should leave a comment at the author of that issue, seeing if he got it work.

Anyways, thanks for the info, ill post it on the forums.

ArthurZucker commented 11 months ago

I am not sure why you need to train a new tokenizer but I don't recommend it. You are completely losing the mapping from input_ids and tokens, thus the preptrained model is rendered useless. You should add tokens to the tokenizers rather than train a new one from scratch if you want to leverage the pretrained checkpoint

sanchit-gandhi commented 11 months ago

Do you know ahead of time what the kind of jargon is? You could first try Whisper prompting by putting your 'jargon' as the prompt:

processor = WhisperProcessor.from_pretrained("openai/whisper-tiny")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny")
input_features = processor(input_speech, return_tensors="pt").input_features

# --- Without prompt ---
prompt_ids = processor.get_prompt_ids("Leighton")
output_without_prompt = model.generate(input_features)
print(processor.decode(output_without_prompt[0]))
# "<|startoftranscript|><|en|><|transcribe|><|notimestamps|> He has grave doubts whether Sir Frederick Layton's work is really Greek after all and can discover in it but little of Rocky Ithaca.<|endoftext|>"

# --- With prompt ---
prompt_ids = processor.get_prompt_ids("Leighton")
output_with_prompt = model.generate(input_features, prompt_ids=prompt_ids)
print(processor.decode(output_with_prompt[0]))
# "<|startofprev|> Leighton<|startoftranscript|><|en|><|transcribe|><|notimestamps|> He has grave doubts whether Sir Frederick Leighton's work is really Greek after all and can discover in it but little of Rocky Ithaca.<|endoftext|>"

Your next best method would be fine-tuning using the original tokenizer on your dataset, using as much data as possible: https://huggingface.co/blog/fine-tune-whisper

If you're in a low-data regime, freezing the encoder is recommended. Call this line before you do trainer.train():

model.freeze_encoder()

After that, see this issue for recommendations for custom vocabulary: https://discuss.huggingface.co/t/adding-custom-vocabularies-on-whisper/29311?u=nbroad. Note that this will require more data than standard fine-tuning, so you should be completely sure standard fine-tuning with the original tokenizer doesn't work before trying this. Also note that as @ArthurZucker mentioned, it is not recommended to completely reset the tokenizer, but rather append the new vocabulary to the tokenizer.

github-actions[bot] commented 10 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.