Weird evaluation result when using distributed training

nguyenvulebinh commented 2 years ago

Environment info

transformers version: 4.15.0
Platform: Linux-4.15.0-20-generic-x86_64-with-debian-buster-sid
Python version: 3.7.11
PyTorch version (GPU?): 1.10.0+cu113 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help

@patrickvonplaten, @anton-l

Information

Model I am using (Bert, XLNet ...):

The problem arises when using:

[x] the official example scripts: (give details below)
[ ] my own modified scripts: (give details below)

The tasks I am working on is:

[ ] an official GLUE/SQUaD task: (give the name)
[x] my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

I use the script run_speech_recognition_ctc.py with my custom dataset but have the same format as the Commonvoice official. In two cases single GPU and multi GPUs, the training loss is fine but the evaluation result is very weird.

After training the model with multi GPUs, I take the last checkpoint to evaluate. The result WER is totally good. So I thought when combining evaluation results from multi GPUs had something wrong.

Trained with multi GPUs.

Trained with single GPU (in progress)

Expected behavior

patrickvonplaten commented 2 years ago

Hey @nguyenvulebinh,

Could you please share the bash command that you used to start your training? Also how many gpus do you use? 2?

patrickvonplaten commented 2 years ago

BTW, we'll announce an event tomorrow where we'll teach you how to train Wav2Vec2 - see: https://github.com/huggingface/transformers/tree/master/examples/research_projects/xls_r. Watch out for the sign-up form if you would like to train wav2vec2 models with us :-)

nguyenvulebinh commented 2 years ago

Here is the command I used to train the model on 2 GPUs. The arguments of the python script I put into the python file.

CUDA_VISIBLE_DEVICES=2,4 python -m torch.distributed.launch --nproc_per_node 2 run_speech_recognition_ctc.py

# "--dataset_name", "common_voice",
# "--dataset_config_name", "vi",
"--data_processing_cache_folder", "./data-bin/processed/cache",
"--preprocessing_num_workers", "30",
"--model_name_or_path", "./model-bin/wav2vec_pretrained/large/",
"--output_dir", "./wav2vec2-large-vlsp2020",
"--logging_dir", "./wav2vec2-large-vlsp2020/log",
"--logging_steps", "100",
"--overwrite_output_dir",
"--num_train_epochs", "50",
"--per_device_train_batch_size", "48",
"--gradient_accumulation_steps", "1",
"--learning_rate", "1e-4",
"--warmup_ratio", "{}".format(1/20),
"--evaluation_strategy", "steps",
"--text_column_name", "sentence",
"--save_steps", "5000",
"--eval_steps", "2500",
"--warmup_steps", "5000",
"--layerdrop", "0.1",
"--hidden_dropout", "0.3",
"--save_total_limit", "3",
"--freeze_feature_encoder",
"--delay_epoch_finetune_wav2vec", "1",
"--gradient_checkpointing",
"--fp16",
#"--preprocessing_only",
"--metric_for_best_model", "wer",
"--greater_is_better", "False",
"--group_by_length",
"--length_column_name", "input_length",
"--dataloader_num_workers", "10",
"--do_train",
"--do_eval",
"--ignore_data_skip"

patrickvonplaten commented 2 years ago

I've never seen delay_epoch_finetune_wav2vec before - are you using a custom loop?

nguyenvulebinh commented 2 years ago

Yes, it is a little custom for freezing the wav2vec layer for one epoch before fine-tuning all layers. I do that thing by using a Callback. I don't think it is the problem.

patrickvonplaten commented 2 years ago

Hmm, this makes it very difficult to guess possible errors here though if it's a costum loop. Could you maybe try to ask help on the forum: https://discuss.huggingface.co/ instead?

github-actions[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / transformers