modern-fortran / neural-fortran

A parallel framework for deep learning
MIT License
395 stars 82 forks source link

Added Momentum and Nesterov modifications #148

Closed Spnetic-5 closed 1 year ago

Spnetic-5 commented 1 year ago

Solves #137

Spnetic-5 commented 1 year ago

@milancurcic Please review the implementation of suggested changes to the code, and I see the difference in MSE of SGD after adding momentum and nesterov arguments.

milancurcic commented 1 year ago

@Spnetic-5 I refactored the subroutines in the quadratic_fit example to accept xtest and ytest. This allows the subroutines to report RMSE every some number of epochs. For example, this is what the output looks like now:

...
Stochastic gradient descent
----------------------------------
Epoch:  100/1000, RMSE =  0.132655
Epoch:  200/1000, RMSE =  0.131642
Epoch:  300/1000, RMSE =  0.131308
Epoch:  400/1000, RMSE =  0.131156
Epoch:  500/1000, RMSE =  0.131071
Epoch:  600/1000, RMSE =  0.131018
Epoch:  700/1000, RMSE =  0.130981
Epoch:  800/1000, RMSE =  0.130953
Epoch:  900/1000, RMSE =  0.130932
Epoch: 1000/1000, RMSE =  0.130915

Batch gradient descent
----------------------------------
Epoch:  100/1000, RMSE =  0.474867
Epoch:  200/1000, RMSE =  0.406169
Epoch:  300/1000, RMSE =  0.356077
Epoch:  400/1000, RMSE =  0.320101
Epoch:  500/1000, RMSE =  0.294075
Epoch:  600/1000, RMSE =  0.274908
Epoch:  700/1000, RMSE =  0.260488
Epoch:  800/1000, RMSE =  0.249400
Epoch:  900/1000, RMSE =  0.240700
Epoch: 1000/1000, RMSE =  0.233746

Minibatch gradient descent
----------------------------------
Epoch:  100/1000, RMSE =  0.189657
Epoch:  200/1000, RMSE =  0.186826
Epoch:  300/1000, RMSE =  0.185389
Epoch:  400/1000, RMSE =  0.183885
Epoch:  500/1000, RMSE =  0.181658
Epoch:  600/1000, RMSE =  0.177690
Epoch:  700/1000, RMSE =  0.170329
Epoch:  800/1000, RMSE =  0.159604
Epoch:  900/1000, RMSE =  0.149710
Epoch: 1000/1000, RMSE =  0.143441

RMSProp optimizer
----------------------------------
Epoch:  100/1000, RMSE =  0.198733
Epoch:  200/1000, RMSE =  0.167841
Epoch:  300/1000, RMSE =  0.138085
Epoch:  400/1000, RMSE =  0.134165
Epoch:  500/1000, RMSE =  0.132821
Epoch:  600/1000, RMSE =  0.132069
Epoch:  700/1000, RMSE =  0.131591
Epoch:  800/1000, RMSE =  0.131288
Epoch:  900/1000, RMSE =  0.131106
Epoch: 1000/1000, RMSE =  0.130999

This allows us to see how the RMSE evolves during the training.

While doing this, I also discovered a small but serious error in the calculation of xtest. Before the fix, it was evaluated as an integer expression, so xtest values were zeros and ones and no values in between. That means that we were so far only evaluating at these two points, which gave us a false idea about the RMSE.

milancurcic commented 1 year ago

I'll continue with the remainder of the review over the weekend.

milancurcic commented 1 year ago

I've added the necessary data structures in optimizer type definitions to allow keeping track of velocity in case of SGD and RMS gradients in case of RMSProp. The code doesn't work yet--we need to correctly initialize these structures and correctly reference them during each successive minimization calls. The main challenge is that currently we have the minimize method which is an elemental subroutine that works on a per-layer basis. Thus, it's not aware of on which layer it's working on. I'm not yet clear on the best approach but here are some ideas I've been thinking about:

  1. Rather than minimizing at a layer level, we minimize on the network level; We loop over layers and have all the information we need to reference the velocity/RMS values correctly; basically the RMSProp subroutine approach in examples/quadratic.f90.
  2. Pass the layer number/name to optimizer % minimize() to let it know on which layer we are. The subroutine can then correctly address the velocity/RMS arrays.
  3. Possibly the simplest approach: Treat all model parameters and gradients as 1-d arrays, get their size via net % get_num_params(), and update their weights via call net % set_params(). The downside is that getting the gradients from all layers in one place will require a new allocation (maybe, let's see).
milancurcic commented 1 year ago

I took the approach 3 above for the time being. We also need layer % get_gradients() and network % get_gradients() methods which are not implemented yet, but they will be essentially the same procedures as layer % get_params() and network % get_params().

Because of velocity in SGD and rms_gradient in RMSProp being allocatable arrays, they can't be used in the arithmetic inside the elemental minimize methods, so minimize is no longer elemental.

Once we have getting the layer gradients implemented and everything is hooked up correctly, it should just work for both SGD (with or without the classic or Nesterov momentum) and RMSProp. Once it does, this huge PR must be merged and we'll make a new minor release.

milancurcic commented 1 year ago

We're almost there. Now we just need to add a test suite for the optimizer classes to ensure that they both converge on a simple problem, e.g. like the one in examples/simple.f90.

Spnetic-5 commented 1 year ago

Sure, I'll try to implement test suite

milancurcic commented 1 year ago

@Spnetic-5 Thanks! To clarify: since the default SGD is already being tested implicitly in e.g. examples/test_dense_network.f90 and elsewhere (simple and sine examples), for this test suite let's focus on:

For ideas on how to write the test suite program, follow how other test programs are written.

milancurcic commented 1 year ago

BTW, I wrote in d27bf09 that tests pass in debug but segfault in release mode. After I pushed it, both debug and release modes passed in CI. Then I went back and saw that I was working locally with GFortran v9. When I switched to v10, both modes passed tests.

Spnetic-5 commented 1 year ago

@milancurcic I have added draft test suite, please confirm if I am on a right track, any suggestions before proceeding further.

Also are we continuing test_suite in the same PR or should I open another explicit PR?

milancurcic commented 1 year ago

Thanks @Spnetic-5, yes, the structure of the program is good. I see that you're not actually testing for convergence, I assume because it's on a TODO.

I realized that for optimizers that need to store state (velocity in SGD w/ momentum and RMSProp in general), we actually need to store the optimizer instance as part of the network type so that it can preserve state between successive net % update() calls. Currently we always pass a freshly created optimizer instance, so there is no storing of state. As you're working on the test suite, can you also add a

class(optimizer_base_type), allocatable :: optimizer

member to the network type? And then in train and update we need to store it on first call, but leave it untouched on successive calls.

You can use the allocated(self % optimizer) to check if it's been allocated or not, as a test to determine whether it needs to be set for the first time or not.

milancurcic commented 1 year ago

@Spnetic-5 I went ahead and added the optimizer instance as a member of the network as it may not be a trivial thing to do, so you can focus on implementing the convergence tests.

Spnetic-5 commented 1 year ago

@Spnetic-5 I went ahead and added the optimizer instance as a member of the network as it may not be a trivial thing to do, so you can focus on implementing the convergence tests.

Thanks @milancurcic, sure I'm working on convergence tests.