Closed skeydan closed 2 years ago
In the meantime, I've noticed something else: The linear update to the learning rate
lrs[i] <- self$optimizer$param_groups[[i]]$lr * (self$end_lr / self$optimizer$param_groups[[i]]$lr) ^ (self$last_epoch / self$iters)
means that we have very little representation of low learning rates, and vice versa:
1 0.0001083356 2.4614830
2 0.0001269458 2.3519001
3 0.0001602593 2.2732861
4 0.0002166283 2.1725931
5 0.0003110180 2.1210122
6 0.0004697272 2.0187278
7 0.0007382696 1.9209898
8 0.0011937887 1.7630707
9 0.0019631112 1.6140277
10 0.0032461048 1.5705203
11 0.0053406731 1.4852645
12 0.0086604215 1.3220248
13 0.0137304483 1.5796776
14 0.0211449096 1.7366034
15 0.0314769864 3.8761604
16 0.0451495726 6.6005144
17 0.0622978613 5.3288274
18 0.0826703543 6.2529149
19 0.1056103241 5.6417775
20 0.1301335721 6.0517468
21 0.1550825920 7.5857854
22 0.1793105554 5.9789000
23 0.2018433277 4.2130308
24 0.2219831671 4.8658214
25 0.2393427883 5.0581737
26 0.2538204340 11.4970751
27 0.2655382837 13.6970263
28 0.2747675839 17.0547066
29 0.2818581943 15.4653778
30 0.2871824142 29.1765594
31 0.2910961758 27.3584576
32 0.2939162660 23.2627010
33 0.2959101352 21.6323414
34 0.2972943950 18.7045059
35 0.2982385698 19.8880005
36 0.2988714901 15.3784094
37 0.2992885431 14.8322916
38 0.2995586978 19.8532715
39 0.2997307284 13.6776218
40 0.2998384080 16.6825924
41 0.2999046502 9.6280003
42 0.2999446934 7.7134237
43 0.2999684740 7.7692466
44 0.2999823450 5.9394960
45 0.2999902896 4.5372977
46 0.2999947564 6.2856917
47 0.2999972209 12.7986326
48 0.2999985548 13.0505371
49 0.2999992630 14.2187357
50 0.2999996315 12.0380602
51 0.2999998194 7.3475771
52 0.2999999133 14.8489075
53 0.2999999593 10.3688183
54 0.2999999813 12.3401890
55 0.2999999916 8.5191517
56 0.2999999963 9.4620705
57 0.2999999984 7.6367407
58 0.2999999993 7.3901806
59 0.2999999997 6.9529352
60 0.2999999999 6.4206972
61 0.3000000000 5.4685631
62 0.3000000000 4.0115976
63 0.3000000000 5.2044287
64 0.3000000000 4.6368008
65 0.3000000000 3.8385959
66 0.3000000000 3.4284925
67 0.3000000000 4.2496982
68 0.3000000000 2.9259527
69 0.3000000000 3.3436406
70 0.3000000000 3.6924629
71 0.3000000000 2.3569930
72 0.3000000000 2.4658334
73 0.3000000000 2.1678135
74 0.3000000000 2.3311815
75 0.3000000000 2.1689682
76 0.3000000000 2.2175486
77 0.3000000000 1.8884877
78 0.3000000000 1.9370137
79 0.3000000000 1.9622595
80 0.3000000000 1.9369893
81 0.3000000000 1.6718340
82 0.3000000000 1.9107810
83 0.3000000000 1.5918878
84 0.3000000000 1.6824082
85 0.3000000000 1.7195944
86 0.3000000000 1.5178674
87 0.3000000000 1.3063056
88 0.3000000000 1.6406715
89 0.3000000000 1.5309871
90 0.3000000000 1.5005339
91 0.3000000000 1.5982531
92 0.3000000000 1.2381088
93 0.3000000000 1.4246720
94 0.3000000000 1.4040461
95 0.3000000000 1.1783930
96 0.3000000000 1.2010577
97 0.3000000000 1.1817559
98 0.3000000000 0.9616594
99 0.3000000000 1.3799338
100 0.3000000000 0.9910566
After changing to
self$multiplier <- (end_lr/start_lr)^(1/n_iters)
lrs[i] <- self$optimizer$param_groups[[i]]$lr * self$multiplier
(again, following Sylvain Gugger's post https://sgugger.github.io/how-do-you-find-a-good-learning-rate.html )
it looks like this:
> rates_and_losses
lr loss
1 0.0001083356 2.4352272
2 0.0001173660 2.3659532
3 0.0001271492 2.2688315
4 0.0001377479 2.2244060
5 0.0001492300 2.1558168
6 0.0001616692 2.0787528
7 0.0001751453 2.0653663
8 0.0001897447 1.9588171
9 0.0002055611 1.9151413
10 0.0002226959 1.8843687
11 0.0002412589 1.8045338
12 0.0002613693 1.7947593
13 0.0002831560 1.6866556
14 0.0003067588 1.6983283
15 0.0003323290 1.6600128
16 0.0003600306 1.5919452
17 0.0003900413 1.5313349
18 0.0004225536 1.4699878
19 0.0004577760 1.4201385
20 0.0004959344 1.3273145
21 0.0005372736 1.4024817
22 0.0005820586 1.3270025
23 0.0006305767 1.2821181
24 0.0006831390 1.2964771
25 0.0007400828 1.1751999
26 0.0008017732 1.2328001
27 0.0008686058 1.1578518
28 0.0009410094 1.1054106
29 0.0010194482 1.0325199
30 0.0011044254 0.9430044
31 0.0011964859 0.9398008
32 0.0012962202 1.0129902
33 0.0014042680 0.9202777
34 0.0015213223 0.8984810
35 0.0016481337 0.8365785
36 0.0017855156 0.8255904
37 0.0019343491 0.7296498
38 0.0020955888 0.8311969
39 0.0022702688 0.7243664
40 0.0024595095 0.6857455
41 0.0026645245 0.7184579
42 0.0028866287 0.6488046
43 0.0031272467 0.6794920
44 0.0033879216 0.5804777
45 0.0036703254 0.6073003
46 0.0039762692 0.5804019
47 0.0043077153 0.6195212
48 0.0046667894 0.6198562
49 0.0050557945 0.5246215
50 0.0054772256 0.6417997
51 0.0059337855 0.5398391
52 0.0064284024 0.5210927
53 0.0069642486 0.5183672
54 0.0075447608 0.5714974
55 0.0081736623 0.5089962
56 0.0088549865 0.4797995
57 0.0095931032 0.5666440
58 0.0103927463 0.5583856
59 0.0112590446 0.6538793
60 0.0121975541 0.6618884
61 0.0132142940 0.5681854
62 0.0143157853 0.5513948
63 0.0155090927 0.6577442
64 0.0168018693 0.5736349
65 0.0182024068 0.6227516
66 0.0197196875 0.4796226
67 0.0213634427 0.5836340
68 0.0231442149 0.5269037
69 0.0250734252 0.6377246
70 0.0271634469 0.8338370
71 0.0294276845 0.5299796
72 0.0318806600 0.7992362
73 0.0345381058 1.2265229
74 0.0374170659 0.8356883
75 0.0405360046 1.1078746
76 0.0439149258 1.2965573
77 0.0475755005 1.3705090
78 0.0515412063 1.3104806
79 0.0558374776 1.2511322
80 0.0604918691 0.9643733
81 0.0655342323 1.1771545
82 0.0709969070 1.1576290
83 0.0769149286 0.8901078
84 0.0833262532 0.8606523
85 0.0902720004 0.8590534
86 0.0977967177 1.0099466
87 0.1059486657 0.8008906
88 0.1147801278 0.9692470
89 0.1243477458 0.9043386
90 0.1347128826 1.0453824
91 0.1459420162 0.7656193
92 0.1581071660 0.8675176
93 0.1712863547 0.8484798
94 0.1855641085 0.7998012
95 0.2010319994 1.1184809
96 0.2177892326 1.3061916
97 0.2359432825 1.7504541
98 0.2556105823 2.2053964
99 0.2769172705 2.4416337
100 0.3000000000 2.8977308
For a simple convnet on MNIST, I now have the following plots:
1) with smoothing
2) without
Actually, comparing both plots now, the smoothing looks less important. So if you prefer, we can also just update the logic (today's commit),and forget about the smoothing.
Hi @dfalbel as discussed :-)
Here are examples of how the plot looks now - one for uniform splits, one for log-spaced splits:
What do you think?
First proposal of implementing a smoothed display of the loss, as done here: https://sgugger.github.io/how-do-you-find-a-good-learning-rate.html and (1:1 translation to R) here: https://blogs.rstudio.com/ai/posts/2020-10-19-torch-image-classification/.
From practical experience, this should be more helpful to most users.
Implementation-wise, it seems like it has to go into
plot.lr_records
, in which case I don't really know how we would want to make it configurable (in a useful way) ... For now, just so one can compare, I've added argumentssmoothed_loss
andbeta
to this method.Here are example plots of smoothed vs. non-smoothed loss:
What do you think Daniel?