huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
131.25k stars 26.09k forks source link

Weird evaluation result when using distributed training #15107

Closed nguyenvulebinh closed 2 years ago

nguyenvulebinh commented 2 years ago

Environment info

Who can help

@patrickvonplaten, @anton-l

Information

Model I am using (Bert, XLNet ...):

The problem arises when using:

The tasks I am working on is:

To reproduce

Steps to reproduce the behavior:

I use the script run_speech_recognition_ctc.py with my custom dataset but have the same format as the Commonvoice official. In two cases single GPU and multi GPUs, the training loss is fine but the evaluation result is very weird.

After training the model with multi GPUs, I take the last checkpoint to evaluate. The result WER is totally good. So I thought when combining evaluation results from multi GPUs had something wrong.

  1. Trained with multi GPUs.
Screen Shot 2022-01-11 at 17 21 03
  1. Trained with single GPU (in progress) Screen Shot 2022-01-11 at 17 32 07

Expected behavior

patrickvonplaten commented 2 years ago

Hey @nguyenvulebinh,

Could you please share the bash command that you used to start your training? Also how many gpus do you use? 2?

patrickvonplaten commented 2 years ago

BTW, we'll announce an event tomorrow where we'll teach you how to train Wav2Vec2 - see: https://github.com/huggingface/transformers/tree/master/examples/research_projects/xls_r. Watch out for the sign-up form if you would like to train wav2vec2 models with us :-)

nguyenvulebinh commented 2 years ago

Here is the command I used to train the model on 2 GPUs. The arguments of the python script I put into the python file.

CUDA_VISIBLE_DEVICES=2,4 python -m torch.distributed.launch --nproc_per_node 2 run_speech_recognition_ctc.py
# "--dataset_name", "common_voice",
# "--dataset_config_name", "vi",
"--data_processing_cache_folder", "./data-bin/processed/cache",
"--preprocessing_num_workers", "30",
"--model_name_or_path", "./model-bin/wav2vec_pretrained/large/",
"--output_dir", "./wav2vec2-large-vlsp2020",
"--logging_dir", "./wav2vec2-large-vlsp2020/log",
"--logging_steps", "100",
"--overwrite_output_dir",
"--num_train_epochs", "50",
"--per_device_train_batch_size", "48",
"--gradient_accumulation_steps", "1",
"--learning_rate", "1e-4",
"--warmup_ratio", "{}".format(1/20),
"--evaluation_strategy", "steps",
"--text_column_name", "sentence",
"--save_steps", "5000",
"--eval_steps", "2500",
"--warmup_steps", "5000",
"--layerdrop", "0.1",
"--hidden_dropout", "0.3",
"--save_total_limit", "3",
"--freeze_feature_encoder",
"--delay_epoch_finetune_wav2vec", "1",
"--gradient_checkpointing",
"--fp16",
#"--preprocessing_only",
"--metric_for_best_model", "wer",
"--greater_is_better", "False",
"--group_by_length",
"--length_column_name", "input_length",
"--dataloader_num_workers", "10",
"--do_train",
"--do_eval",
"--ignore_data_skip"
patrickvonplaten commented 2 years ago

I've never seen delay_epoch_finetune_wav2vec before - are you using a custom loop?

nguyenvulebinh commented 2 years ago

Yes, it is a little custom for freezing the wav2vec layer for one epoch before fine-tuning all layers. I do that thing by using a Callback. I don't think it is the problem.

patrickvonplaten commented 2 years ago

Hmm, this makes it very difficult to guess possible errors here though if it's a costum loop. Could you maybe try to ask help on the forum: https://discuss.huggingface.co/ instead?

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.