Closed LifaSun closed 3 years ago
I've assigned Patrick, but looking at the docs of Wav2Vec2, is says:
Wav2Vec2 models that have set config.feat_extract_norm == "group", such as wav2vec2-base, have not been trained using attention_mask. For such models, input_values should simply be padded with 0 and no attention_mask should be passed.
For Wav2Vec2 models that have set config.feat_extract_norm == "layer", such as wav2vec2-lv60, attention_mask should be passed for batched inference.
It seems like the pre-training script currently only supports models that are pre-trained using an attention mask, such as patrickvonplaten/wav2vec2-base-libri-100h
.
@NielsRogge
Got it! It works well now. Thank you for your advice!
@NielsRogge The training process can start normally. But the loss doesn't decrease any more after ~300 steps. I have tried different datasets, including English and Chinese data. Could you help me check it? I appreciate it so much!
{'loss': 4.0485, 'learning_rate': 3.3333333333333335e-05, 'epoch': 0.07} {'loss': 3.7386, 'learning_rate': 3.5000000000000004e-05, 'epoch': 0.07} {'loss': 1.5081, 'learning_rate': 3.6666666666666666e-05, 'epoch': 0.07} {'loss': 4.2322, 'learning_rate': 3.8333333333333334e-05, 'epoch': 0.08} {'loss': 4.1046, 'learning_rate': 4e-05, 'epoch': 0.08} {'loss': 3.2526, 'learning_rate': 4.1666666666666665e-05, 'epoch': 0.08} {'loss': 1.5949, 'learning_rate': 4.3333333333333334e-05, 'epoch': 0.09} {'loss': 0.0013, 'learning_rate': 4.4999999999999996e-05, 'epoch': 0.09} {'loss': 0.0013, 'learning_rate': 4.666666666666667e-05, 'epoch': 0.09} {'loss': 0.0013, 'learning_rate': 4.8333333333333334e-05, 'epoch': 0.1}
{'loss': 0.0013, 'learning_rate': 5e-05, 'epoch': 0.1} {'loss': 0.0013, 'learning_rate': 5.1666666666666664e-05, 'epoch': 0.1}
{'loss': 0.0013, 'learning_rate': 5.333333333333334e-05, 'epoch': 0.11}
{'loss': 0.0013, 'learning_rate': 5.5e-05, 'epoch': 0.11} 4%|███▏ | 340/8922 [07:55<3:33:42, 1.49s/it] {'loss': 0.0013, 'learning_rate': 5.6666666666666664e-05, 'epoch': 0.11} 4%|███▎ | 350/8922 [08:04<1:50:16, 1.30it/s] {'loss': 0.0014, 'learning_rate': 5.833333333333333e-05, 'epoch': 0.12} {'loss': 0.0013, 'learning_rate': 6e-05, 'epoch': 0.12} 4%|███▍ | 370/8922 [08:34<2:31:36, 1.06s/it] {'loss': 0.0013, 'learning_rate': 6.166666666666667e-05, 'epoch': 0.12} {'loss': 0.0013, 'learning_rate': 6.333333333333335e-05, 'epoch': 0.13} {'loss': 0.0013, 'learning_rate': 6.500000000000001e-05, 'epoch': 0.13} {'loss': 0.0013, 'learning_rate': 6.666666666666667e-05, 'epoch': 0.13} {'loss': 0.0013, 'learning_rate': 6.833333333333333e-05, 'epoch': 0.14}
Btw, others have the same problem. Refer to https://discuss.huggingface.co/t/why-is-wav2vec-pretraining-loss-not-decreasing/8112
@NielsRogge The training process can start normally. But the loss doesn't decrease any more after ~300 steps. I have tried different datasets, including English and Chinese data. Could you help me check it? I appreciate it so much!
{'loss': 4.0485, 'learning_rate': 3.3333333333333335e-05, 'epoch': 0.07} {'loss': 3.7386, 'learning_rate': 3.5000000000000004e-05, 'epoch': 0.07} {'loss': 1.5081, 'learning_rate': 3.6666666666666666e-05, 'epoch': 0.07} {'loss': 4.2322, 'learning_rate': 3.8333333333333334e-05, 'epoch': 0.08} {'loss': 4.1046, 'learning_rate': 4e-05, 'epoch': 0.08} {'loss': 3.2526, 'learning_rate': 4.1666666666666665e-05, 'epoch': 0.08} {'loss': 1.5949, 'learning_rate': 4.3333333333333334e-05, 'epoch': 0.09} {'loss': 0.0013, 'learning_rate': 4.4999999999999996e-05, 'epoch': 0.09} {'loss': 0.0013, 'learning_rate': 4.666666666666667e-05, 'epoch': 0.09} {'loss': 0.0013, 'learning_rate': 4.8333333333333334e-05, 'epoch': 0.1}
{'loss': 0.0013, 'learning_rate': 5e-05, 'epoch': 0.1} {'loss': 0.0013, 'learning_rate': 5.1666666666666664e-05, 'epoch': 0.1}
{'loss': 0.0013, 'learning_rate': 5.333333333333334e-05, 'epoch': 0.11}
{'loss': 0.0013, 'learning_rate': 5.5e-05, 'epoch': 0.11} 4%|███▏ | 340/8922 [07:55<3:33:42, 1.49s/it] {'loss': 0.0013, 'learning_rate': 5.6666666666666664e-05, 'epoch': 0.11} 4%|███▎ | 350/8922 [08:04<1:50:16, 1.30it/s] {'loss': 0.0014, 'learning_rate': 5.833333333333333e-05, 'epoch': 0.12} {'loss': 0.0013, 'learning_rate': 6e-05, 'epoch': 0.12} 4%|███▍ | 370/8922 [08:34<2:31:36, 1.06s/it] {'loss': 0.0013, 'learning_rate': 6.166666666666667e-05, 'epoch': 0.12} {'loss': 0.0013, 'learning_rate': 6.333333333333335e-05, 'epoch': 0.13} {'loss': 0.0013, 'learning_rate': 6.500000000000001e-05, 'epoch': 0.13} {'loss': 0.0013, 'learning_rate': 6.666666666666667e-05, 'epoch': 0.13} {'loss': 0.0013, 'learning_rate': 6.833333333333333e-05, 'epoch': 0.14}
Btw, others have the same problem. Refer to https://discuss.huggingface.co/t/why-is-wav2vec-pretraining-loss-not-decreasing/8112
Hello, I’m facing the same problem pretraining my model from English base model. Have you solved it?
Hey guys,
I think this is a good example of how it looks like when the "contrastive_loss"
function collapses and the training becomes useless. If you see an instant drop to 0.0013
this means that the training didn't work. I've seen this countless times in my tests and there is not a very easy fix IMO.
What seems to work best to counteract this is to do the following in this line: https://github.com/huggingface/transformers/blob/4c99e553c152ce9b709d7c138379b0b126ed2fa1/examples/pytorch/speech-pretraining/run_wav2vec2_pretraining_no_trainer.py#L327
Replace:
mask_time_indices=mask_time_indices,
by mask_time_indices=batch["sub_attention_mask"]
This is known to be a more robust training that however seems to give slightly worse results.
Also, I think speechbrain is working quite a bit on getting Wav2Vec2-Pretraining more robust and general, as far as I know those guys have done much more experiements with pretraining than I have so it might be worth checking out their pretraining script as well.
cc @TParcollet
I'm hoping to find some time to again dive a bit deeper into wav2vec2 pretraining over the Chrismas holidays and then make a comprehensive guide on how to pretrain wav2vec2 at some point. I'm really not sure though whether I find the time
Hey guys,
I think this is a good example of how it looks like when the
"contrastive_loss"
function collapses and the training becomes useless. If you see an instant drop to0.0013
this means that the training didn't work. I've seen this countless times in my tests and there is not a very easy fix IMO.What seems to work best to counteract this is to do the following in this line:
Replace:
mask_time_indices=mask_time_indices,
bymask_time_indices=batch["sub_attention_mask"]
This is known to be a more robust training that however seems to give slightly worse results.
Also, I think speechbrain is working quite a bit on getting Wav2Vec2-Pretraining more robust and general, as far as I know those guys have done much more experiements with pretraining than I have so it might be worth checking out their pretraining script as well.
cc @TParcollet
Hi. The %_mask_idx
i got is so low, I wonder if you changed mask_prob
in the configuration file from 0.05 to 0.5?
For passing the mask_prob should be around 0.65
FYI I ran into the same issue (missing attention_mask in pre-trained model) saving my model on a custom dataset from the greek emotion classification using wav2vec2 from this notebook:
Changing the model to 'facebook/wav2vec2-large-960h-lv60-self' helped.
Environment info
transformers
version: 4.9.1Who can help
Models: @patrickvonplaten
Information
Model I am using Wav2vec Pretrain:
The problem arises when using: https://github.com/huggingface/transformers/blob/master/examples/research_projects/wav2vec2/run_pretrain.py
The tasks I am working on is:
To reproduce
Steps to reproduce the behavior:
python run_pretrain.py --output_dir="./wav2vec2-base" \ --num_train_epochs="3" \ --per_device_train_batch_size="32" \ --per_device_eval_batch_size="32" \ --gradient_accumulation_steps="2" \ --save_total_limit="3" \ --save_steps="500" \ --logging_steps="10" \ --learning_rate="5e-4" \ --weight_decay="0.01" \ --warmup_steps="3000" \ --model_name_or_path="facebook/wav2vec2-base" \ --dataset_name="timit_asr" \ --train_split_name="train" \ --preprocessing_num_workers="4" \ --max_duration_in_seconds="10.0" \ --group_by_length \ --verbose_logging \
Expected behavior
Running training Num examples = 185 Num Epochs = 3 Instantaneous batch size per device = 32 Total train batch size (w. parallel, distributed & accumulation) = 64 Gradient Accumulation steps = 2 Total optimization steps = 9 0% 0/9 [00:00<?, ?it/s]Traceback (most recent call last): File "wav2vec_pretrain.py", line 388, in
main()
File "wav2vec_pretrain.py", line 384, in main
trainer.train()
File "/usr/local/lib/python3.7/dist-packages/transformers/trainer.py", line 1254, in train
for step, inputs in enumerate(epoch_iterator):
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 521, in next
data = self._next_data()
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 561, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
return self.collate_fn(data)
File "wav2vec_pretrain.py", line 176, in call
if batch["attention_mask"] is not None:
File "/usr/local/lib/python3.7/dist-packages/transformers/feature_extraction_utils.py", line 81, in getitem
return self.data[item]
KeyError: 'attention_mask'
Thank you very much!