udacity / deep-learning-v2-pytorch

Projects and exercises for the latest Deep Learning ND program https://www.udacity.com/course/deep-learning-nanodegree--nd101
MIT License
5.26k stars 5.32k forks source link

Validation loss does not decrease #341

Closed rakshitraj closed 3 years ago

rakshitraj commented 3 years ago

In the notebook MNIST-MLP-with-validation the validation loss does not show a clear trend of decreasing through the epochs. What then, should be the ideal number of epochs to stop training?

Refer to the loss vs. epochs plot below:

mnist_mlp_with_validation

mnist_mlp_with_validation_large

Refer to the training output below:

Epoch: 1    Training Loss: 0.173275     Validation Loss: 0.072151
Epoch: 2    Training Loss: 0.180673     Validation Loss: 0.073222
Epoch: 3    Training Loss: 0.170470     Validation Loss: 0.071506
Epoch: 4    Training Loss: 0.169157     Validation Loss: 0.071434
Epoch: 5    Training Loss: 0.164801     Validation Loss: 0.073985
Epoch: 6    Training Loss: 0.164058     Validation Loss: 0.072523
Epoch: 7    Training Loss: 0.155625     Validation Loss: 0.073031
Epoch: 8    Training Loss: 0.142002     Validation Loss: 0.074034
Epoch: 9    Training Loss: 0.150512     Validation Loss: 0.073295
Epoch: 10   Training Loss: 0.153645     Validation Loss: 0.071989
Epoch: 11   Training Loss: 0.142902     Validation Loss: 0.071597
Epoch: 12   Training Loss: 0.144769     Validation Loss: 0.073353
Epoch: 13   Training Loss: 0.137011     Validation Loss: 0.075439
Epoch: 14   Training Loss: 0.128657     Validation Loss: 0.074456
Epoch: 15   Training Loss: 0.127031     Validation Loss: 0.072143
Epoch: 16   Training Loss: 0.130579     Validation Loss: 0.074429
Epoch: 17   Training Loss: 0.118127     Validation Loss: 0.075609
Epoch: 18   Training Loss: 0.112879     Validation Loss: 0.075535
Epoch: 19   Training Loss: 0.128090     Validation Loss: 0.073065
Epoch: 20   Training Loss: 0.112375     Validation Loss: 0.074643
Epoch: 21   Training Loss: 0.115144     Validation Loss: 0.074257
Epoch: 22   Training Loss: 0.106094     Validation Loss: 0.074520
Epoch: 23   Training Loss: 0.109507     Validation Loss: 0.075728
Epoch: 24   Training Loss: 0.109469     Validation Loss: 0.074699
Epoch: 25   Training Loss: 0.110970     Validation Loss: 0.073907
Epoch: 26   Training Loss: 0.096544     Validation Loss: 0.075542
Epoch: 27   Training Loss: 0.103879     Validation Loss: 0.077541
Epoch: 28   Training Loss: 0.102920     Validation Loss: 0.077198
Epoch: 29   Training Loss: 0.094102     Validation Loss: 0.075062
Epoch: 30   Training Loss: 0.090215     Validation Loss: 0.074472
Epoch: 31   Training Loss: 0.086098     Validation Loss: 0.074997
Epoch: 32   Training Loss: 0.097066     Validation Loss: 0.076258
Epoch: 33   Training Loss: 0.085929     Validation Loss: 0.075598
Epoch: 34   Training Loss: 0.084987     Validation Loss: 0.077572
Epoch: 35   Training Loss: 0.086108     Validation Loss: 0.075918
Epoch: 36   Training Loss: 0.080684     Validation Loss: 0.075090
Epoch: 37   Training Loss: 0.084099     Validation Loss: 0.077466
Epoch: 38   Training Loss: 0.084811     Validation Loss: 0.074790
Epoch: 39   Training Loss: 0.074331     Validation Loss: 0.076772
Epoch: 40   Training Loss: 0.084416     Validation Loss: 0.076223
Epoch: 41   Training Loss: 0.072896     Validation Loss: 0.076234
Epoch: 42   Training Loss: 0.068715     Validation Loss: 0.077100
Epoch: 43   Training Loss: 0.069165     Validation Loss: 0.074734
Epoch: 44   Training Loss: 0.074903     Validation Loss: 0.077468
Epoch: 45   Training Loss: 0.074492     Validation Loss: 0.078448
Epoch: 46   Training Loss: 0.067751     Validation Loss: 0.078467
Epoch: 47   Training Loss: 0.063131     Validation Loss: 0.077308
Epoch: 48   Training Loss: 0.066979     Validation Loss: 0.077013
Epoch: 49   Training Loss: 0.077288     Validation Loss: 0.077013
Epoch: 50   Training Loss: 0.066796     Validation Loss: 0.079931
Epoch: 51   Training Loss: 0.067261     Validation Loss: 0.077724
Epoch: 52   Training Loss: 0.059412     Validation Loss: 0.079420
Epoch: 53   Training Loss: 0.065499     Validation Loss: 0.080297
Epoch: 54   Training Loss: 0.068294     Validation Loss: 0.078086
Epoch: 55   Training Loss: 0.063540     Validation Loss: 0.078597
Epoch: 56   Training Loss: 0.064520     Validation Loss: 0.078995
Epoch: 57   Training Loss: 0.056037     Validation Loss: 0.078627
Epoch: 58   Training Loss: 0.060160     Validation Loss: 0.078080
Epoch: 59   Training Loss: 0.059273     Validation Loss: 0.077827
Epoch: 60   Training Loss: 0.060961     Validation Loss: 0.075285
Epoch: 61   Training Loss: 0.057903     Validation Loss: 0.077738
Epoch: 62   Training Loss: 0.063609     Validation Loss: 0.077966
Epoch: 63   Training Loss: 0.057536     Validation Loss: 0.077801
Epoch: 64   Training Loss: 0.057395     Validation Loss: 0.078586
Epoch: 65   Training Loss: 0.049628     Validation Loss: 0.078420
Epoch: 66   Training Loss: 0.056603     Validation Loss: 0.078370
Epoch: 67   Training Loss: 0.064318     Validation Loss: 0.080227
Epoch: 68   Training Loss: 0.051121     Validation Loss: 0.079326
Epoch: 69   Training Loss: 0.060982     Validation Loss: 0.079202
Epoch: 70   Training Loss: 0.049256     Validation Loss: 0.079026
Epoch: 71   Training Loss: 0.048598     Validation Loss: 0.079224
Epoch: 72   Training Loss: 0.049865     Validation Loss: 0.078970
Epoch: 73   Training Loss: 0.055158     Validation Loss: 0.079744
Epoch: 74   Training Loss: 0.054415     Validation Loss: 0.081481
Epoch: 75   Training Loss: 0.052432     Validation Loss: 0.081267
Epoch: 76   Training Loss: 0.051720     Validation Loss: 0.081768
Epoch: 77   Training Loss: 0.053280     Validation Loss: 0.080334
Epoch: 78   Training Loss: 0.051649     Validation Loss: 0.079957
Epoch: 79   Training Loss: 0.053677     Validation Loss: 0.081049
Epoch: 80   Training Loss: 0.049124     Validation Loss: 0.081066
Epoch: 81   Training Loss: 0.050054     Validation Loss: 0.080952
Epoch: 82   Training Loss: 0.048973     Validation Loss: 0.077719
Epoch: 83   Training Loss: 0.047101     Validation Loss: 0.080348
Epoch: 84   Training Loss: 0.047056     Validation Loss: 0.079272
Epoch: 85   Training Loss: 0.047837     Validation Loss: 0.079291
Epoch: 86   Training Loss: 0.045731     Validation Loss: 0.081468
Epoch: 87   Training Loss: 0.040476     Validation Loss: 0.080208
Epoch: 88   Training Loss: 0.043921     Validation Loss: 0.081199
Epoch: 89   Training Loss: 0.054620     Validation Loss: 0.080562
Epoch: 90   Training Loss: 0.042875     Validation Loss: 0.080026
Epoch: 91   Training Loss: 0.050372     Validation Loss: 0.080715
Epoch: 92   Training Loss: 0.043803     Validation Loss: 0.080129
Epoch: 93   Training Loss: 0.043946     Validation Loss: 0.080731
Epoch: 94   Training Loss: 0.037849     Validation Loss: 0.080961
Epoch: 95   Training Loss: 0.043542     Validation Loss: 0.081609
Epoch: 96   Training Loss: 0.040342     Validation Loss: 0.081530
Epoch: 97   Training Loss: 0.044127     Validation Loss: 0.081840
Epoch: 98   Training Loss: 0.037339     Validation Loss: 0.081128
Epoch: 99   Training Loss: 0.043082     Validation Loss: 0.082328
Epoch: 100  Training Loss: 0.040335     Validation Loss: 0.081088
ronny-udacity commented 3 years ago

You can create a condition early stopping function by setting up a threshold and monitor the delta in the training loss. When the training loss is smaller than the threshold, you can stop the training loop. Here's the implementation example from StackOverflow. Or, you can also use some third-party libraries that support early stopping function (see example here).

rakshitraj commented 3 years ago

Please note that the validation loss shows a general trend of increasing. Obviously, it reduces from one epoch to the next, but the overall validation loss across a few epochs is infact, increasing.