Closed Spnetic-5 closed 1 year ago
@milancurcic Please review the implementation of suggested changes to the code, and I see the difference in MSE of SGD after adding momentum
and nesterov
arguments.
@Spnetic-5 I refactored the subroutines in the quadratic_fit example to accept xtest
and ytest
. This allows the subroutines to report RMSE every some number of epochs. For example, this is what the output looks like now:
...
Stochastic gradient descent
----------------------------------
Epoch: 100/1000, RMSE = 0.132655
Epoch: 200/1000, RMSE = 0.131642
Epoch: 300/1000, RMSE = 0.131308
Epoch: 400/1000, RMSE = 0.131156
Epoch: 500/1000, RMSE = 0.131071
Epoch: 600/1000, RMSE = 0.131018
Epoch: 700/1000, RMSE = 0.130981
Epoch: 800/1000, RMSE = 0.130953
Epoch: 900/1000, RMSE = 0.130932
Epoch: 1000/1000, RMSE = 0.130915
Batch gradient descent
----------------------------------
Epoch: 100/1000, RMSE = 0.474867
Epoch: 200/1000, RMSE = 0.406169
Epoch: 300/1000, RMSE = 0.356077
Epoch: 400/1000, RMSE = 0.320101
Epoch: 500/1000, RMSE = 0.294075
Epoch: 600/1000, RMSE = 0.274908
Epoch: 700/1000, RMSE = 0.260488
Epoch: 800/1000, RMSE = 0.249400
Epoch: 900/1000, RMSE = 0.240700
Epoch: 1000/1000, RMSE = 0.233746
Minibatch gradient descent
----------------------------------
Epoch: 100/1000, RMSE = 0.189657
Epoch: 200/1000, RMSE = 0.186826
Epoch: 300/1000, RMSE = 0.185389
Epoch: 400/1000, RMSE = 0.183885
Epoch: 500/1000, RMSE = 0.181658
Epoch: 600/1000, RMSE = 0.177690
Epoch: 700/1000, RMSE = 0.170329
Epoch: 800/1000, RMSE = 0.159604
Epoch: 900/1000, RMSE = 0.149710
Epoch: 1000/1000, RMSE = 0.143441
RMSProp optimizer
----------------------------------
Epoch: 100/1000, RMSE = 0.198733
Epoch: 200/1000, RMSE = 0.167841
Epoch: 300/1000, RMSE = 0.138085
Epoch: 400/1000, RMSE = 0.134165
Epoch: 500/1000, RMSE = 0.132821
Epoch: 600/1000, RMSE = 0.132069
Epoch: 700/1000, RMSE = 0.131591
Epoch: 800/1000, RMSE = 0.131288
Epoch: 900/1000, RMSE = 0.131106
Epoch: 1000/1000, RMSE = 0.130999
This allows us to see how the RMSE evolves during the training.
While doing this, I also discovered a small but serious error in the calculation of xtest
. Before the fix, it was evaluated as an integer expression, so xtest
values were zeros and ones and no values in between. That means that we were so far only evaluating at these two points, which gave us a false idea about the RMSE.
I'll continue with the remainder of the review over the weekend.
I've added the necessary data structures in optimizer type definitions to allow keeping track of velocity in case of SGD and RMS gradients in case of RMSProp. The code doesn't work yet--we need to correctly initialize these structures and correctly reference them during each successive minimization calls. The main challenge is that currently we have the minimize
method which is an elemental
subroutine that works on a per-layer basis. Thus, it's not aware of on which layer it's working on. I'm not yet clear on the best approach but here are some ideas I've been thinking about:
optimizer % minimize()
to let it know on which layer we are. The subroutine can then correctly address the velocity/RMS arrays.net % get_num_params()
, and update their weights via call net % set_params()
. The downside is that getting the gradients from all layers in one place will require a new allocation (maybe, let's see).I took the approach 3 above for the time being. We also need layer % get_gradients()
and network % get_gradients()
methods which are not implemented yet, but they will be essentially the same procedures as layer % get_params()
and network % get_params()
.
Because of velocity
in SGD and rms_gradient
in RMSProp being allocatable arrays, they can't be used in the arithmetic inside the elemental minimize
methods, so minimize
is no longer elemental.
Once we have getting the layer gradients implemented and everything is hooked up correctly, it should just work for both SGD (with or without the classic or Nesterov momentum) and RMSProp. Once it does, this huge PR must be merged and we'll make a new minor release.
We're almost there. Now we just need to add a test suite for the optimizer classes to ensure that they both converge on a simple problem, e.g. like the one in examples/simple.f90.
Sure, I'll try to implement test suite
@Spnetic-5 Thanks! To clarify: since the default SGD is already being tested implicitly in e.g. examples/test_dense_network.f90 and elsewhere (simple and sine examples), for this test suite let's focus on:
For ideas on how to write the test suite program, follow how other test programs are written.
BTW, I wrote in d27bf09 that tests pass in debug but segfault in release mode. After I pushed it, both debug and release modes passed in CI. Then I went back and saw that I was working locally with GFortran v9. When I switched to v10, both modes passed tests.
@milancurcic I have added draft test suite, please confirm if I am on a right track, any suggestions before proceeding further.
Also are we continuing test_suite
in the same PR or should I open another explicit PR?
Thanks @Spnetic-5, yes, the structure of the program is good. I see that you're not actually testing for convergence, I assume because it's on a TODO.
I realized that for optimizers that need to store state (velocity in SGD w/ momentum and RMSProp in general), we actually need to store the optimizer instance as part of the network type so that it can preserve state between successive net % update()
calls. Currently we always pass a freshly created optimizer instance, so there is no storing of state. As you're working on the test suite, can you also add a
class(optimizer_base_type), allocatable :: optimizer
member to the network
type? And then in train
and update
we need to store it on first call, but leave it untouched on successive calls.
You can use the allocated(self % optimizer)
to check if it's been allocated or not, as a test to determine whether it needs to be set for the first time or not.
@Spnetic-5 I went ahead and added the optimizer instance as a member of the network as it may not be a trivial thing to do, so you can focus on implementing the convergence tests.
@Spnetic-5 I went ahead and added the optimizer instance as a member of the network as it may not be a trivial thing to do, so you can focus on implementing the convergence tests.
Thanks @milancurcic, sure I'm working on convergence tests.
Solves #137