Closed baicenxiao closed 4 months ago
Have you also tried scaling the learning rate according to the multiple GPUs? (What I mean by this is in multi-GPU the scheduler is stepped N times, which could account for some of this)
Hi @muellerzr, thanks for the response!
For the experiments above, I have already disabled the learning rate scheduler.
In addition, I have tried adjust the learning rate according to learning_rate *= accelerator.num_processes
given in the official performance guideline. I still see a significant difference in the training performance.
FYI, here is the result after using learning_rate *= 4
when training with 4 GPUs:
(shadow) bxiao@ip-10-45-101-134:/sensei-fs/users/bxiao/test_multiGPUs$ accelerate launch --config_file config.yaml ./cv_example.py --data_dir ./images
The following values were not passed to `accelerate launch` and had defaults used instead:
`--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
0.17.1
0.17.1
0.17.1
0.17.1
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1478/1478 [00:35<00:00, 41.07it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 370/370 [00:10<00:00, 35.43it/s]
epoch 0: 75.24
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1478/1478 [00:34<00:00, 42.35it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 370/370 [00:10<00:00, 36.49it/s]
epoch 1: 76.52
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1478/1478 [00:34<00:00, 42.67it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 370/370 [00:10<00:00, 35.66it/s]
epoch 2: 77.33
Thanks, let me try running this today and see what happens
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
According to performance comparison guideline, if we use
batch_size_multi_gpu
in multiple GPUs scenario, then we can get similar performance if we usebatch_size_single_gpu = batch_size_multi_gpu * num_GPUs
in 1 GPU scenario.But when I testing the official example code, setting
batch_size=4
with single GPU training can give much better performance than settingbatch_size=1
with 4 GPUs training. I disabled the learning rate scheduler in case the learning rate steps in different manner for distributed training.What I changed in the official example script:
batch_size=1
for training with 4 GPUs and usebatch_size=4
for training with single GPU.FYI, below are the config and training output.
Single GPU:
4 GPUs:
Here is the script I modified:
Expected behavior
When setting batch_size according to
batch_size_single_gpu = batch_size_multi_gpu * num_GPUs
, training with single GPU should give similar performance as training with multi GPUs.