GSoC Optimizers: Example program to fit a quadratic function

Spnetic-5 commented 1 year ago

Solving #133 @milancurcic

Optimizers to be implemented:

[x] Batch gradient descent
[x] Mini-batch gradient descent
[x] Stochastic gradient descent

milancurcic commented 1 year ago

Thanks @Spnetic-5, looks like a good start. You already have the pure SGD example. Do you need any help going forward? To allow batch and mini-batch GDs, I suggest defining x and y data as 1-d arrays that will be your entire dataset. Then for SGD, feed x and y elements one at a time, for mini-batch subset multiple batches, and for batch GD pass the entire arrays.

Spnetic-5 commented 1 year ago

Thanks @Spnetic-5, looks like a good start. You already have the pure SGD example. Do you need any help going forward? To allow batch and mini-batch GDs, I suggest defining x and y data as 1-d arrays that will be your entire dataset. Then for SGD, feed x and y elements one at a time, for mini-batch subset multiple batches, and for batch GD pass the entire arrays.

Sure, Thank you for suggesting the approach of using 1-dimensional arrays for the dataset. I'm working on the optimizer code, I'll update the changes soon.

milancurcic commented 1 year ago

Thanks @Spnetic-5 for the work so far. Please study the changes in https://github.com/modern-fortran/neural-fortran/pull/134/commits/bda1968f70d0cbf03bb275cb0bbb043f74d3b102. There were a few important fixes to the code:

The training dataset was allocated to the size of the number of epochs (iterations), but the epochs size are really an outer loop to the sample size; I introduced the train_size parameter which I use to allocate the training x and y.
Not reusing the same network between 3 optimization approaches; each needs its own network instance.
net % forward and net % backward methods needed one sample at a time as inputs rather than whole arrays; this is something we can improve later by allowing to pass a batch of data at once.
Evaluating ypred for each optimization method.

I don't know if the results are correct yet, but the code compiles and produces lower errors with increasing epoch count. On my computer the minibatch GD produces very different results between debug and release profiles, so something still not quite correct there.

We're getting close!

Spnetic-5 commented 1 year ago

You're welcome! I apologize for the errors and the quality of the code pushed earlier. Thank you for pointing out the changes and fixes you made to the code. It's good to hear that the code is now compiling and producing lower errors with increasing epoch count.

I have carefully studied the changes you made, and I understand the modifications you've introduced. It's great to see that the code now compiles without errors and produces lower errors with increasing epoch count. I will continue to review the code and evaluate the results to ensure correctness.

In order to identify the underlying cause and rectify the issue, I will investigate the discrepancies in the minibatch GD results.

These are the results on my PC :

For 1000 epochs:

Stochastic gradient descent MSE: 0.001104
 Batch gradient descent MSE:  0.062504
Minibatch gradient descent MSE:  0.088675

For 5000 epochs:

Stochastic gradient descent MSE: 0.000449
 Batch gradient descent MSE:  0.071504
Minibatch gradient descent MSE:  0.000996

Here, BatchGD is showing a slight increase in MSE, I think as it updates the weights using the entire training dataset in each epoch. As the number of epochs increases, the model starts overfitting the train data, which leads to a higher MSE on test data.

milancurcic commented 1 year ago

@Spnetic-5 in SGD subroutine, can you shuffle the mini-batches so that it's truly stochastic? Currently it loops over the mini-batches in the same order every time. Here's my suggested approach:

Split the dataset into mini-batches (you already have this);
Shuffle the start indices of each mini-batch;
Loop over the shuffled start and and indices to extract the desired mini-batch.

The outcome should be that in each epoch the order of mini-batches is random and different. You can take inspiration from

https://github.com/modern-fortran/neural-fortran/blob/8293118220ec1e9c16abda766e3b273f161592b4/src/nf/nf_network_submodule.f90#L532-L535

but even there the mini-batches are not truly shuffled, but rather the start index is randomly selected so that in each epoch there are some data samples that may be unused and there are some that are used more than once.

Spnetic-5 commented 1 year ago

@milancurcic I have updated weekly progress on discourse, should I now proceed to next RMSProp or Adam optimizer, or there are any more changes required in current optimizers

jvdp1 commented 1 year ago

Thank you @Spnetic-5 for this PR. Currently this optimizer is only implemented in an example and is not available to other users through the library. Therefore my advice would be to look how to integrate this optimizer in the library. @milancurcic what should be the next step?

modern-fortran / neural-fortran

GSoC Optimizers: Example program to fit a quadratic function #134