shwinshaker / LipGrow

An adaptive training algorithm for residual network
14 stars 0 forks source link

Command to run to save 50% training time #1

Closed lehougoogle closed 4 years ago

lehougoogle commented 4 years ago

Hi there,

Congratulations on your paper "Towards Adaptive Residual Network Training: A Neural-ODE Perspective"!

Seems like in the code, the default setting is not optimal (for example, it uses constant learning rate instead of adaptive cosine). Would you please give me instructions on which command to run, to save 50% training time?

Thank you very much!

shwinshaker commented 4 years ago

Hi,

Thanks for your interest in our paper!

We apologize for not updating our repo after the paper submission. Now I have refactored our code and provided some instructions. Please try running launch.sh to use our algorithm to adaptively train a ResNet-74 on CIFAR-10.

If there are any further issues, keep the flow and I am happy to help. :D

lehougoogle commented 4 years ago

Thank you very much!

I ran launch.sh with updated code. After one epoch it gives:

File "train.py", line 195, in main logger.append([epoch, (time.time() - timestart)/60., scheduler.lr(), File "/home/lehou/LipGrow-master/utils/scheduler.py", line 144, in lr_ assert(lrs[0] == self.get_lr()[0]), (lrs[0], self.get_lr()[0], self.last_epoch, 'Inconsistent learning rate between scheduler and optimizer!') AssertionError: (0.49995422384373267, 0.49990845188677696, 1, 'Inconsistent learning rate between scheduler and optimizer!')

I commented scheduler.py:144 out then it runs. Just wanna make sure, is this error expected or not.

Thanks again.

lehougoogle commented 4 years ago

Hi shwinshaker,

I printed some log out:

epoch:1, lrs[0]:0.500000, self.get_lr()[0]:0.500000 epoch:11, lrs[0]:0.495436, self.get_lr()[0]:0.494573 epoch:21, lrs[0]:0.481912, self.get_lr()[0]:0.480174 epoch:31, lrs[0]:0.459922, self.get_lr()[0]:0.457377 epoch:41, lrs[0]:0.430270, self.get_lr()[0]:0.427014 epoch:51, lrs[0]:0.394042, self.get_lr()[0]:0.390197 epoch:61, lrs[0]:0.352563, self.get_lr()[0]:0.348273 epoch:71, lrs[0]:0.307349, self.get_lr()[0]:0.302776 epoch:81, lrs[0]:0.260057, self.get_lr()[0]:0.255369 epoch:91, lrs[0]:0.212414, self.get_lr()[0]:0.207787 epoch:101, lrs[0]:0.166165, self.get_lr()[0]:0.161772 epoch:111, lrs[0]:0.496597, self.get_lr()[0]:0.494716 epoch:121, lrs[0]:0.438650, self.get_lr()[0]:0.430080 epoch:131, lrs[0]:0.324979, self.get_lr()[0]:0.312477 epoch:141, lrs[0]:0.452350, self.get_lr()[0]:0.438561 epoch:151, lrs[0]:0.224420, self.get_lr()[0]:0.201066 epoch:161, lrs[0]:0.022570, self.get_lr()[0]:0.014920

Seems like initially lrs[0] and self.get_lr()[0] were very close to each other, then the relative difference became bigger. Which of these two learning rates are the correct learning rate?

shwinshaker commented 4 years ago

Hi,

Which pytorch version are you using? It smells like a version issue. I am currently using pytorch 1.3.1 and have no such problem.

To elaborate a little bit, get_lr() is implemented in pytorch source code to get the current learning rate in the scheduler. And lrs[0] is the current learning rate used in the optimizer. They should be consistent since the learning rate in the optimizer is modified by the scheduler.

I checked the code for learning rate scheduler in the lastest pytorch version, and found that they've modified the get_lr() implementation. It seems get_lr() will return different learning rates depending on where you are calling it.

If you are using the lastest pytorch, could you try changing self.get_lr() to self.get_last_lr() as suggested in the lastest pytorch documentation? If this still fails, I would suggest trusting the learning rate in the optimizer.

lehougoogle commented 4 years ago

Thanks. After changing get_lr() to get_last_lr() solved the problem! Thank you for the help again!