modern-fortran / neural-fortran

A parallel framework for deep learning
MIT License
395 stars 82 forks source link

CNN training on MNIST does not converge #145

Open milancurcic opened 1 year ago

milancurcic commented 1 year ago

The above suggests that the forward passes of conv2d, maxpool2d, and flatten layers are implemented correctly.

The culprit may be in the implementation of backward methods for any of these layers, or in the backward flow of data.

This should be fixed before the release of v0.13.0.

certik commented 5 months ago

Here is an example output that I am getting:

$ fpm run --example cnn_mnist --profile release --flag "-fno-frontend-optimize -I$CONDA_PREFIX/include -L$CONDA_PREFIX/lib -Wl,-rpath -Wl,$CONDA_PREFIX/lib"
Layer: input
------------------------------------------------------------
Output shape: 784
Parameters: 0

Layer: reshape
------------------------------------------------------------
Input shape: 784
Output shape: 1 28 28
Parameters: 0
Activation: 

Layer: conv2d
------------------------------------------------------------
Input shape: 1 28 28
Output shape: 8 26 26
Parameters: 80
Activation: relu

Layer: maxpool2d
------------------------------------------------------------
Input shape: 8 26 26
Output shape: 8 13 13
Parameters: 0
Activation: 

Layer: conv2d
------------------------------------------------------------
Input shape: 8 13 13
Output shape: 16 11 11
Parameters: 1168
Activation: relu

Layer: maxpool2d
------------------------------------------------------------
Input shape: 16 11 11
Output shape: 16 5 5
Parameters: 0
Activation: 

Layer: flatten
------------------------------------------------------------
Input shape: 16 5 5
Output shape: 400
Parameters: 0
Activation: 

Layer: dense
------------------------------------------------------------
Input shape: 400
Output shape: 10
Parameters: 4010
Activation: softmax

Epoch  1 done, Accuracy:  9.91 %
Epoch  2 done, Accuracy:  9.91 %
Epoch  3 done, Accuracy:  9.91 %
Epoch  4 done, Accuracy:  9.91 %
...

It will stay at this percentage.

certik commented 5 months ago

Git bisect reveals #142:

6bbc28d123cdec20140331edc60df106d518a202 is the first bad commit
commit 6bbc28d123cdec20140331edc60df106d518a202
Author: Milan Curcic <caomaco@gmail.com>
Date:   Thu Jun 22 11:27:03 2023 -0400

    Connect `flatten`, `conv2d`, and `maxpool2d` layers in backward pass (#142)

    * Connect flatten, conv2d, and maxpool2d layers in backward pass

    * Bump minor version

 fpm.toml                        |  2 +-
 src/nf/nf_network_submodule.f90 | 16 +++++++++++-----
 2 files changed, 12 insertions(+), 6 deletions(-)
milancurcic commented 4 months ago

Tests with minimal CNN on randomly selected constant inputs/outputs converge fine (#174). The problem with training CNN on MNIST may be elsewhere, or the bug is more subtle than I previously suspected. Needs more intermediate complexity tests to understand more.