[Wav2vec Pretrain] KeyError: ‘attention_mask’

LifaSun commented 3 years ago

Environment info

transformers version: 4.9.1
Platform: Google Colab
Python version: 3.7 & 3.8
PyTorch version (GPU?): 1.8
Tensorflow version (GPU?): N/A
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help

Models: @patrickvonplaten

Information

Model I am using Wav2vec Pretrain:

The problem arises when using: https://github.com/huggingface/transformers/blob/master/examples/research_projects/wav2vec2/run_pretrain.py

The tasks I am working on is:

[ ] an official wav2vec pretrain task: (give the name)
[ ] my own task or dataset: (give details below) Wav2vec on TIMIT

To reproduce

Steps to reproduce the behavior:

python run_pretrain.py --output_dir="./wav2vec2-base" \ --num_train_epochs="3" \ --per_device_train_batch_size="32" \ --per_device_eval_batch_size="32" \ --gradient_accumulation_steps="2" \ --save_total_limit="3" \ --save_steps="500" \ --logging_steps="10" \ --learning_rate="5e-4" \ --weight_decay="0.01" \ --warmup_steps="3000" \ --model_name_or_path="facebook/wav2vec2-base" \ --dataset_name="timit_asr" \ --train_split_name="train" \ --preprocessing_num_workers="4" \ --max_duration_in_seconds="10.0" \ --group_by_length \ --verbose_logging \

Expected behavior

Running training Num examples = 185 Num Epochs = 3 Instantaneous batch size per device = 32 Total train batch size (w. parallel, distributed & accumulation) = 64 Gradient Accumulation steps = 2 Total optimization steps = 9 0% 0/9 [00:00<?, ?it/s]Traceback (most recent call last): File "wav2vec_pretrain.py", line 388, in main() File "wav2vec_pretrain.py", line 384, in main trainer.train() File "/usr/local/lib/python3.7/dist-packages/transformers/trainer.py", line 1254, in train for step, inputs in enumerate(epoch_iterator): File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 521, in next data = self._next_data() File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 561, in _next_data data = self._dataset_fetcher.fetch(index) # may raise StopIteration File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch return self.collate_fn(data) File "wav2vec_pretrain.py", line 176, in call if batch["attention_mask"] is not None: File "/usr/local/lib/python3.7/dist-packages/transformers/feature_extraction_utils.py", line 81, in getitem return self.data[item] KeyError: 'attention_mask'

Thank you very much!

NielsRogge commented 3 years ago

I've assigned Patrick, but looking at the docs of Wav2Vec2, is says:

Wav2Vec2 models that have set config.feat_extract_norm == "group", such as wav2vec2-base, have not been trained using attention_mask. For such models, input_values should simply be padded with 0 and no attention_mask should be passed.

For Wav2Vec2 models that have set config.feat_extract_norm == "layer", such as wav2vec2-lv60, attention_mask should be passed for batched inference.

It seems like the pre-training script currently only supports models that are pre-trained using an attention mask, such as patrickvonplaten/wav2vec2-base-libri-100h.

LifaSun commented 3 years ago

@NielsRogge

Got it! It works well now. Thank you for your advice!

LifaSun commented 3 years ago

@NielsRogge The training process can start normally. But the loss doesn't decrease any more after ~300 steps. I have tried different datasets, including English and Chinese data. Could you help me check it? I appreciate it so much!

{'loss': 4.0485, 'learning_rate': 3.3333333333333335e-05, 'epoch': 0.07} {'loss': 3.7386, 'learning_rate': 3.5000000000000004e-05, 'epoch': 0.07} {'loss': 1.5081, 'learning_rate': 3.6666666666666666e-05, 'epoch': 0.07} {'loss': 4.2322, 'learning_rate': 3.8333333333333334e-05, 'epoch': 0.08} {'loss': 4.1046, 'learning_rate': 4e-05, 'epoch': 0.08} {'loss': 3.2526, 'learning_rate': 4.1666666666666665e-05, 'epoch': 0.08} {'loss': 1.5949, 'learning_rate': 4.3333333333333334e-05, 'epoch': 0.09} {'loss': 0.0013, 'learning_rate': 4.4999999999999996e-05, 'epoch': 0.09} {'loss': 0.0013, 'learning_rate': 4.666666666666667e-05, 'epoch': 0.09} {'loss': 0.0013, 'learning_rate': 4.8333333333333334e-05, 'epoch': 0.1}

{'loss': 0.0013, 'learning_rate': 5e-05, 'epoch': 0.1} {'loss': 0.0013, 'learning_rate': 5.1666666666666664e-05, 'epoch': 0.1}

{'loss': 0.0013, 'learning_rate': 5.333333333333334e-05, 'epoch': 0.11}

{'loss': 0.0013, 'learning_rate': 5.5e-05, 'epoch': 0.11} 4%|███▏ | 340/8922 [07:55<3:33:42, 1.49s/it] {'loss': 0.0013, 'learning_rate': 5.6666666666666664e-05, 'epoch': 0.11} 4%|███▎ | 350/8922 [08:04<1:50:16, 1.30it/s] {'loss': 0.0014, 'learning_rate': 5.833333333333333e-05, 'epoch': 0.12} {'loss': 0.0013, 'learning_rate': 6e-05, 'epoch': 0.12} 4%|███▍ | 370/8922 [08:34<2:31:36, 1.06s/it] {'loss': 0.0013, 'learning_rate': 6.166666666666667e-05, 'epoch': 0.12} {'loss': 0.0013, 'learning_rate': 6.333333333333335e-05, 'epoch': 0.13} {'loss': 0.0013, 'learning_rate': 6.500000000000001e-05, 'epoch': 0.13} {'loss': 0.0013, 'learning_rate': 6.666666666666667e-05, 'epoch': 0.13} {'loss': 0.0013, 'learning_rate': 6.833333333333333e-05, 'epoch': 0.14}

Btw, others have the same problem. Refer to https://discuss.huggingface.co/t/why-is-wav2vec-pretraining-loss-not-decreasing/8112

Nian-Chen commented 2 years ago

@NielsRogge The training process can start normally. But the loss doesn't decrease any more after ~300 steps. I have tried different datasets, including English and Chinese data. Could you help me check it? I appreciate it so much!

{'loss': 4.0485, 'learning_rate': 3.3333333333333335e-05, 'epoch': 0.07} {'loss': 3.7386, 'learning_rate': 3.5000000000000004e-05, 'epoch': 0.07} {'loss': 1.5081, 'learning_rate': 3.6666666666666666e-05, 'epoch': 0.07} {'loss': 4.2322, 'learning_rate': 3.8333333333333334e-05, 'epoch': 0.08} {'loss': 4.1046, 'learning_rate': 4e-05, 'epoch': 0.08} {'loss': 3.2526, 'learning_rate': 4.1666666666666665e-05, 'epoch': 0.08} {'loss': 1.5949, 'learning_rate': 4.3333333333333334e-05, 'epoch': 0.09} {'loss': 0.0013, 'learning_rate': 4.4999999999999996e-05, 'epoch': 0.09} {'loss': 0.0013, 'learning_rate': 4.666666666666667e-05, 'epoch': 0.09} {'loss': 0.0013, 'learning_rate': 4.8333333333333334e-05, 'epoch': 0.1}

{'loss': 0.0013, 'learning_rate': 5e-05, 'epoch': 0.1} {'loss': 0.0013, 'learning_rate': 5.1666666666666664e-05, 'epoch': 0.1}

{'loss': 0.0013, 'learning_rate': 5.333333333333334e-05, 'epoch': 0.11}

{'loss': 0.0013, 'learning_rate': 5.5e-05, 'epoch': 0.11} 4%|███▏ | 340/8922 [07:55<3:33:42, 1.49s/it] {'loss': 0.0013, 'learning_rate': 5.6666666666666664e-05, 'epoch': 0.11} 4%|███▎ | 350/8922 [08:04<1:50:16, 1.30it/s] {'loss': 0.0014, 'learning_rate': 5.833333333333333e-05, 'epoch': 0.12} {'loss': 0.0013, 'learning_rate': 6e-05, 'epoch': 0.12} 4%|███▍ | 370/8922 [08:34<2:31:36, 1.06s/it] {'loss': 0.0013, 'learning_rate': 6.166666666666667e-05, 'epoch': 0.12} {'loss': 0.0013, 'learning_rate': 6.333333333333335e-05, 'epoch': 0.13} {'loss': 0.0013, 'learning_rate': 6.500000000000001e-05, 'epoch': 0.13} {'loss': 0.0013, 'learning_rate': 6.666666666666667e-05, 'epoch': 0.13} {'loss': 0.0013, 'learning_rate': 6.833333333333333e-05, 'epoch': 0.14}

Btw, others have the same problem. Refer to https://discuss.huggingface.co/t/why-is-wav2vec-pretraining-loss-not-decreasing/8112

Hello, I’m facing the same problem pretraining my model from English base model. Have you solved it?

patrickvonplaten commented 2 years ago

Hey guys,

I think this is a good example of how it looks like when the "contrastive_loss" function collapses and the training becomes useless. If you see an instant drop to 0.0013 this means that the training didn't work. I've seen this countless times in my tests and there is not a very easy fix IMO.

What seems to work best to counteract this is to do the following in this line: https://github.com/huggingface/transformers/blob/4c99e553c152ce9b709d7c138379b0b126ed2fa1/examples/pytorch/speech-pretraining/run_wav2vec2_pretraining_no_trainer.py#L327

Replace: mask_time_indices=mask_time_indices, by mask_time_indices=batch["sub_attention_mask"]

This is known to be a more robust training that however seems to give slightly worse results.

Also, I think speechbrain is working quite a bit on getting Wav2Vec2-Pretraining more robust and general, as far as I know those guys have done much more experiements with pretraining than I have so it might be worth checking out their pretraining script as well.

cc @TParcollet

patrickvonplaten commented 2 years ago

I'm hoping to find some time to again dive a bit deeper into wav2vec2 pretraining over the Chrismas holidays and then make a comprehensive guide on how to pretrain wav2vec2 at some point. I'm really not sure though whether I find the time

Nian-Chen commented 2 years ago

Hey guys,

I think this is a good example of how it looks like when the "contrastive_loss" function collapses and the training becomes useless. If you see an instant drop to 0.0013 this means that the training didn't work. I've seen this countless times in my tests and there is not a very easy fix IMO.

What seems to work best to counteract this is to do the following in this line:

https://github.com/huggingface/transformers/blob/4c99e553c152ce9b709d7c138379b0b126ed2fa1/examples/pytorch/speech-pretraining/run_wav2vec2_pretraining_no_trainer.py#L327

Replace: mask_time_indices=mask_time_indices, by mask_time_indices=batch["sub_attention_mask"]

This is known to be a more robust training that however seems to give slightly worse results.

Also, I think speechbrain is working quite a bit on getting Wav2Vec2-Pretraining more robust and general, as far as I know those guys have done much more experiements with pretraining than I have so it might be worth checking out their pretraining script as well.

cc @TParcollet

Hi. The %_mask_idx i got is so low, I wonder if you changed mask_prob in the configuration file from 0.05 to 0.5?

patrickvonplaten commented 2 years ago

For passing the mask_prob should be around 0.65

padmalcom commented 1 year ago

FYI I ran into the same issue (missing attention_mask in pre-trained model) saving my model on a custom dataset from the greek emotion classification using wav2vec2 from this notebook:

https://colab.research.google.com/github/m3hrdadfi/soxan/blob/main/notebooks/Emotion_recognition_in_Greek_speech_using_Wav2Vec2.ipynb#scrollTo=n0HzBneBK84G

Changing the model to 'facebook/wav2vec2-large-960h-lv60-self' helped.

huggingface / transformers