pytorch / pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration
https://pytorch.org
Other
83.18k stars 22.43k forks source link

Learning rate change is not applied at designated iteration with a scheduler #40492

Open cyh767 opened 4 years ago

cyh767 commented 4 years ago

πŸ› Bug

With a scheduler, the learning rate changes at the designated iteration, but it seems the same iteration still applies the learning rate before change.

To Reproduce

A minimal example is attached: test_lr.zip

Steps to reproduce the behavior:

  1. Run python main_steplr.py; this applies a torch.optim.lr_scheduler.StepLR with step_size=5 and gamma=0.1. The logs on my computer are as follows:

iteration: 0 learning rate: 0.1 loss = 0.23123088479042053 iteration: 1 learning rate: 0.1
loss = 0.11247935891151428 iteration: 2
learning rate: 0.1 loss = 0.12026116997003555 iteration: 3 learning rate: 0.1 loss = 0.11437922716140747 iteration: 4 learning rate: 0.1 loss = 0.11330760270357132 iteration: 5 learning rate: 0.010000000000000002 loss = 0.11355291306972504 iteration: 6 learning rate: 0.010000000000000002 loss = 0.1208689734339714

  1. Run python main_fixlr.py; this applies a fixed learning rate. The logs on my computer are as follows:

iteration: 0 learning rate: 0.1 loss = 0.23123088479042053 iteration: 1
learning rate: 0.1 loss = 0.11247935891151428 iteration: 2 learning rate: 0.1
loss = 0.12026116997003555 iteration: 3 learning rate: 0.1 loss = 0.11437922716140747 iteration: 4 learning rate: 0.1 loss = 0.11330760270357132 iteration: 5 learning rate: 0.1 loss = 0.11355291306972504 iteration: 6 learning rate: 0.1 loss = 0.12065751850605011

Expected behavior

The above steps compare between a learning rate decreased at iteration 5 and fixing this learning rate.

In step 1, according the documentation, the learning rate should be 0.1 if iteration < 5, and be 0.01 if 5 <= iteration < 10. However, although learning rate changes in iteration 5 in step 1, the loss is the same in the iteration 5 of step 1 and step 2. In other words, in iteration 5, different learning rates lead to the same loss. I think the changed learn rate might not be correctly applied in the designated iteration in step 1.

Environment

PyTorch version: 1.5.0 Is debug build: No CUDA used to build PyTorch: 10.2

OS: Microsoft Windows 10 δΈ“δΈšη‰ˆ (Pro) GCC version: (Rev5, Built by MSYS2 project) 5.3.0 CMake version: version 3.15.5

Python version: 3.7 Is CUDA available: Yes CUDA runtime version: 10.2.89 GPU models and configuration: Could not collect Nvidia driver version: Could not collect cuDNN version: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.2\bin\cudnn64_7.dll

Versions of relevant libraries: [pip] numpy==1.18.1 [pip] numpydoc==0.9.1 [pip] torch==1.5.0 [pip] torchvision==0.6.0 [conda] blas 1.0 mkl defaults [conda] mkl 2019.4 245 defaults [conda] mkl-service 2.3.0 py37hb782905_0 defaults [conda] mkl_fft 1.0.14 py37h14836fe_0 defaults [conda] mkl_random 1.1.0 py37h675688f_0 defaults [conda] pytorch 1.5.0 py3.7_cuda102_cudnn7_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch [conda] torchvision 0.6.0 py37_cu102 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch

cc @vincentqb

vincentqb commented 4 years ago

When running your code with gamma=10, we see the loss eventually becoming nan, so the learning is indeed affecting the stepping. Have you observed convergence with different models?

cyh767 commented 4 years ago

Thank you for your reply. I agree that the stepping is applied: I have observed near-convergence with more complex models; I think the model will converge with more iterations. But similarly, it seems the model is not learning with the changed learning rate at the designated step. It's similar to the example above: the learning rate changed at iteration: 5, but the loss at that step is the same as the loss using a fixed learning rate.

Moreover, I have found a clearer case today. With the same codes and the same steps, I set initial_learning_rate = 10 and gamma = 0.1 in both main_steplr.py and main_fixlr.py, with the other settings not changed. I see the following behavior:

  1. Run python main_steplr.py; this time the setting is initial_learning_rate = 10, step_size=5 and gamma=0.1 with a torch.optim.lr_scheduler.StepLR. The logs are as follows on my computer:

iteration: 0 learning rate: 10 loss = 0.23123088479042053 iteration: 1
learning rate: 10 loss = 0.11434487253427505 iteration: 2 learning rate: 10 loss = 0.11738215386867523 iteration: 3 learning rate: 10 loss = 0.11591885983943939 iteration: 4 learning rate: 10 loss = 0.11567183583974838 iteration: 5 learning rate: 1.0 loss = 0.11595271527767181 iteration: 6 learning rate: 1.0 loss = 0.1214301735162735 iteration: 7 learning rate: 1.0 loss = 0.1200874075293541 iteration: 8 learning rate: 1.0 loss = 0.1178327202796936 iteration: 9 learning rate: 1.0 loss = 0.1179950013756752 iteration: 10 learning rate: 0.1 loss = 0.11520280689001083 iteration: 11 learning rate: 0.1 loss = 0.12716393172740936 iteration: 12 learning rate: 0.1 loss = 0.12422306835651398 iteration: 13 learning rate: 0.1 loss = 0.13028551638126373 iteration: 14 learning rate: 0.1 loss = 0.12836600840091705 iteration: 15 learning rate: 0.010000000000000002 loss = 0.11887713521718979 iteration: 16 learning rate: 0.010000000000000002 loss = 0.1185910627245903 iteration: 17 learning rate: 0.010000000000000002 loss = 0.11371660977602005 iteration: 18 learning rate: 0.010000000000000002 loss = 0.10454636812210083 iteration: 19 learning rate: 0.010000000000000002 loss = 0.12165459990501404 iteration: 20 learning rate: 0.0010000000000000002 loss = 0.11898376792669296 iteration: 21 learning rate: 0.0010000000000000002 loss = 0.1097453162074089

  1. Run python main_fixlr.py ; this applies a fixed initial_learning_rate = 10. The logs are:

iteration: 0 learning rate: 10 loss = 0.23123088479042053 iteration: 1
learning rate: 10 loss = 0.11434487253427505 iteration: 2 learning rate: 10 loss = 0.11738215386867523 iteration: 3 learning rate: 10 loss = 0.11591885983943939 iteration: 4 learning rate: 10 loss = 0.11567183583974838 iteration: 5 learning rate: 10 loss = 0.11595271527767181 iteration: 6 learning rate: 10 loss = 0.1214301735162735 iteration: 7 learning rate: 10 loss = 0.1200874075293541 iteration: 8 learning rate: 10 loss = 0.1178327202796936 iteration: 9 learning rate: 10 loss = 0.1179950013756752 iteration: 10 learning rate: 10 loss = 0.11520280689001083 iteration: 11 learning rate: 10 loss = 0.12716393172740936 iteration: 12 learning rate: 10 loss = 0.12422306835651398 iteration: 13 learning rate: 10 loss = 0.13030129671096802 iteration: 14 learning rate: 10 loss = 0.12838180363178253 iteration: 15 learning rate: 10 loss = 0.11887713521718979 iteration: 16 learning rate: 10 loss = 0.1185910627245903 iteration: 17 learning rate: 10 loss = 0.11371660977602005 iteration: 18 learning rate: 10 loss = 0.10454636812210083 iteration: 19 learning rate: 10 loss = 0.12169446051120758 iteration: 20 learning rate: 10 loss = 0.11902362108230591 iteration: 21 learning rate: 10 loss = 0.1097453162074089

The losses are the same till iteration: 18, although the learning rate changes with step_size=5 in step 1.