Segfault runing mnist example

ofmla commented 1 month ago

Hi, I am getting the following error when running the mnist example via fpm

$ fpm run --example mnist --profile debug --compiler ifort
mod_constants.f90                      done.
mod_random.f90                         done.
mod_constants.f90                      done.
mod_misc.f90                           done.
mod_types.f90                          done.
mod_clipper.f90                        done.
mod_misc.f90                           done.
mod_accuracy.f90                       done.
mod_normalisation.f90                  done.
mod_lr_decay.f90                       done.
mod_metrics.f90                        done.
mod_loss.f90                           done.
mod_regulariser.f90                    done.
mod_activation_relu.f90                done.
mod_activation_tanh.f90                done.
mod_base_layer.f90                     done.
mod_initialiser_lecun.f90              done.
mod_misc_ml.f90                        done.
mod_activation_piecewise.f90           done.
mod_activation_linear.f90              done.
mod_initialiser_zeros.f90              done.
mod_initialiser_gaussian.f90           done.
mod_activation_gaussian.f90            done.
mod_initialiser_glorot.f90             done.
mod_activation_softmax.f90             done.
mod_activation_leaky_relu.f90          done.
mod_initialiser_ones.f90               done.
mod_initialiser_ident.f90              done.
mod_initialiser_he.f90                 done.
mod_optimiser.f90                      done.
mod_activation_sigmoid.f90             done.
mod_tools_infile.f90                   done.
mod_activation_none.f90                done.
mod_avgpool2d_layer.f90                done.
mod_flatten2d_layer.f90                done.
mod_initialiser.f90                    done.
mod_maxpool2d_layer.f90                done.
mod_input1d_layer.f90                  done.
mod_container_layer.f90                done.
mod_input4d_layer.f90                  done.
mod_flatten4d_layer.f90                done.
mod_avgpool1d_layer.f90                done.
mod_maxpool3d_layer.f90                done.
mod_flatten3d_layer.f90                done.
mod_avgpool3d_layer.f90                done.
mod_dropblock2d_layer.f90              done.
mod_base_layer_sub.f90                 done.
mod_dropblock3d_layer.f90              done.
mod_dropout_layer.f90                  done.
mod_input2d_layer.f90                  done.
mod_flatten1d_layer.f90                done.
mod_maxpool1d_layer.f90                done.
mod_activation.f90                     done.
mod_input3d_layer.f90                  done.
mod_batchnorm1d_layer.f90              done.
mod_conv2d_layer.f90                   done.
mod_conv1d_layer.f90                   done.
mod_conv3d_layer.f90                   done.
mod_network.f90                        done.
mod_batchnorm2d_layer.f90              done.
mod_full_layer.f90                     done.
mod_batchnorm3d_layer.f90              done.
mod_container_layer_sub.f90            done.
athena.f90                             done.
mod_network_sub.f90                    done.
main.f90                               done.
main.f90                               done.
mod_inputs.f90                         done.
mod_read_mnist.f90                     done.
libathena.a                            done.
main.f90                               done.
main.f90                               done.
main.f90                               done.
main.f90                               done.
mnist_3D                               done.
sine                                   done.
mnist                                  done.
mnist_bn                               done.
simple                                 done.
mnist_drop                             done.
[100%] Project compiled successfully.
Using file 'example/mnist/test_job.in'
Metric: accuracy, threshold:  0.100E-01
 Stocastic Gradient Descent momentum-based adaptive learning method
 momentum = 0.9000000
 No regularisation set
 Dropout method: none
 ======PARAMETERS======
 shuffle dataset: T
 batch learning: T
 learning rate: 9.9999998E-03
 number of epochs: 10
 number of filters: 32
 hidden layers: 100
 ======================
 Data read
 Data read
 Shuffling training dataset...
 Training dataset shuffled
 Initialising CNN...
 CONV2D input gradients turned off
CONV2D activation function: relu
CONV2D kernel initialiser: he_uniform
CONV2D bias initialiser: zeros
FULL activation function: relu
FULL kernel initialiser: he_uniform
FULL bias initialiser: he_uniform
FULL activation function: softmax
FULL kernel initialiser: glorot_uniform
FULL bias initialiser: glorot_uniform
 Loss method: Categorical Cross Entropy
 layer: 1 inpt
 30 30 1
 30 30 1
 layer: 2 conv
 28 28 1
 28 28 32
 layer: 3 pool
 28 28 32
 14 14 32
 layer: 4 full
 6272
 100
 layer: 5 full
 100
 10
 NUMBER OF LAYERS 6
 Starting training...
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source             
libpthread-2.28.s  00007F9CF2633CF0  Unknown               Unknown  Unknown
mnist              00000000004910AB  full_layer_MP_bac         726  mod_full_layer.f90
mnist              0000000000480C7E  full_layer_MP_bac         280  mod_full_layer.f90
mnist              0000000000423793  container_layer_M         143  mod_container_layer_sub.f90
mnist              00000000005FAACC  network_MP_backwa        1001  mod_network_sub.f90
mnist              000000000060183C  network_MP_train_        1219  mod_network_sub.f90
mnist              00000000004085B4  MAIN__                    147  main.f90
mnist              0000000000403D4D  Unknown               Unknown  Unknown
libc-2.28.so       00007F9CF2091D85  __libc_start_main     Unknown  Unknown
mnist              0000000000403C6E  Unknown               Unknown  Unknown
<ERROR> Execution for object " mnist " returned exit code  174
<ERROR> *cmd_run*:stopping due to failed executions
STOP 174

my compiler version is

$ ifort -v 
ifort version 2021.10.0

and the sine and simple examples run without problems

nedtaylor commented 1 month ago

Hi @ofmla, thanks for sending over this problem. 🙂 I’ll have a look into this. Three questions.

What version (or branch) of athena are you using, what version of fpm are you using, and are you using the default settings for the input file?

ofmla commented 1 month ago

Hi, I just followed the instructions in the README, cloned the repo, downloaded the MNIST dataset from the indicated link, and ran the examples with fpm. I am using the main branch, the fpm version is 0.10.1-alpha, and yes, I used the default settings. I also tried with smaller batch sizes.

ofmla commented 1 month ago

Hi, I just followed the instructions in the README, cloned the repo, downloaded the MNIST dataset from the indicated link, and ran the examples with fpm. I am using the main branch, the fpm version is 0.10.1-alpha, and yes, I used the default settings. I also tried with smaller batch sizes.

I was able to run the mnist example with gfortran (gcc 12.3 and gcc 13.2), which are the tested versions indicated in the README, but not with ifx version 2023.2.0or ifort version 2021.10.0. I couldn't test gfortran (gcc 14.1.0) as I don't have that one available.

nedtaylor commented 1 month ago

Thanks for all of the details, @ofmla. 🙂

With my initial test, I get it running with ifort version 2021.7.0 without any problems (fpm version 0.9.0 alpha). Here are the steps I went through:

  git clone https://github.com/nedtaylor/athena.git
  cd athena
  emacs example/mnist/test_job.in ##to change the 'dataset_dir' line
  fpm run --example mnist --profile debug --compiler ifort

I don't currently have access to my computer with 2021.10.0 on, but I'll see if I can get it on this one and see if the versions are different.

As a note to self: The line breaking it looks to be pointing to the following in src/lib/mod_full_layer.f90:

    bias_diff = this%transfer%differentiate([1._real12])

nedtaylor commented 1 month ago

I have not been able to get ifort 2021.10 on my computer, but I have got the setup running on a GitHub action now and I can confirm that I get the same error.

On the GitHub Action, the error also occurs with ifort 2021.7 and ifx 2023.2. So this might be an architecture issue.

nedtaylor commented 1 month ago

This issue cannot be reproduced with the following Linux operating systems:

Rocky Linux 8.9 (ifort 2021.7)
Scientific Linux 7.9 (ifort 2021.10)

ofmla commented 1 month ago

I would like to ask a question outside the initial topic of this issue, which was the report of a possible bug. Would it be possible to implement a simple convolutional network like ResNet with Athena? I ask this because I see that the design of the convolutional layers in ResNet is Conv2d -> BN -> ReLU. By looking at the source code quickly, I see that it's possible to pass None as the activation to Conv2d, but how can I add the activation after the BatchNorm layer? I’m still in the early stages of learning about machine learning in general, and I’m not sure if the order of layers matters, so any guidance would be greatly appreciated.

nedtaylor commented 1 month ago

Quick update on this issue. I am having real difficulty fixing this bug as it doesn't appear on any of the machines I develop on and have access to (mostly coding on MacOS and older Linux versions, MacOS isn't supported by ifort, whilst my Linux versions don't reproduce this). GitHub Actions reproduce the issue, but testing via those is a very slow process.

I would like to ask a question outside the initial topic of this issue, which was the report of a possible bug. Would it be possible to implement a simple convolutional network like ResNet with Athena? I ask this because I see that the design of the convolutional layers in ResNet is Conv2d -> BN -> ReLU. By looking at the source code quickly, I see that it's possible to pass None as the activation to Conv2d, but how can I add the activation after the BatchNorm layer? I’m still in the early stages of learning about machine learning in general, and I’m not sure if the order of layers matters, so any guidance would be greatly appreciated.

The convolutional layer in ResNet that you describe cannot currently be reproduced with athena. However, this seems like something that should be implemented. Order of layers almost always matters (I only say "almost always" because there could be situations that I am unaware of). Athena was developed more in line with Tensorflow, where activation functions are built into layers. I have not, personally, had the need to have a separate layer that is just an activation function, but I can see the use of it. I will set up an issue to get it implemented. It shouldn't be too difficult, so implementation timescale shouldn't take too long. Thanks for suggesting this. :)

nedtaylor commented 1 month ago

Break encountered with following setup:

OS: Ubuntu 22.04
ifort 2021.13.1 and ifx 2024.2.1

Error encountered with batch_size > 2. Break occurs on first line of full_layer_type backward_2d procedure call (first line that either prints if pure removed, or first line of maths).

NOTE: batch_size=3 causes break on delta(:,:) evaluation line, whereas batch_size=32 causes break on first line.

nedtaylor commented 1 month ago

@ofmla Okay, it's not an issue with the code (although, maybe there should be some warning message printed by the code when it expects this to happen).

In this example, the full_layer_type needs to store the gradients of the weights and for the first full_layer_type in the model, this is 6272 100 batch_size (for batch_size=32, that's over 20 million). To get around this (and this is why my other computers didn't encounter this as I have this on by default), you need to use the following command in your terminal:

ulimit -s unlimited

However, I would caution one to consider whether this is something one wants to use, as it means that the user has set their limits to unlimited which COULD cause issues if not used carefully.

Why gfortran doesn't encounter this issue when ifort does, I don't know. I guess they have different data management issues.

I won't close this issue yet as it does seem like I should add some verbose warning (and/or in debug mode) to caution the user if an array like this is going to be set to over 1 million in size.

ofmla commented 1 month ago

Thank you for taking the time to investigate what the problem was

nedtaylor commented 1 month ago

Thank you for taking the time to investigate what the problem was

No problem, thanks for bringing up the issue. 🙂 I haven't had a chance to work any more on this yet (focusing on residual network implementation instead #47). But I have found that I think one of the edits I have made relating to issue #19 actually fixes/improves this (reduces size of the temporary storage of gradients of weights by solving it one sample at a time), so I don't know if a warning message is needed anymore.

nedtaylor / athena

Segfault runing mnist example #45