Open ofmla opened 1 month ago
Hi @ofmla, thanks for sending over this problem. 🙂 I’ll have a look into this. Three questions.
What version (or branch) of athena are you using, what version of fpm are you using, and are you using the default settings for the input file?
Hi, I just followed the instructions in the README, cloned the repo, downloaded the MNIST dataset from the indicated link, and ran the examples with fpm. I am using the main branch, the fpm version is 0.10.1-alpha, and yes, I used the default settings. I also tried with smaller batch sizes.
Hi, I just followed the instructions in the README, cloned the repo, downloaded the MNIST dataset from the indicated link, and ran the examples with fpm. I am using the main branch, the fpm version is 0.10.1-alpha, and yes, I used the default settings. I also tried with smaller batch sizes.
I was able to run the mnist example with gfortran (gcc 12.3 and gcc 13.2), which are the tested versions indicated in the README, but not with ifx version 2023.2.0
or ifort version 2021.10.0
. I couldn't test gfortran (gcc 14.1.0) as I don't have that one available.
Thanks for all of the details, @ofmla. 🙂
With my initial test, I get it running with ifort version 2021.7.0 without any problems (fpm version 0.9.0 alpha). Here are the steps I went through:
git clone https://github.com/nedtaylor/athena.git
cd athena
emacs example/mnist/test_job.in ##to change the 'dataset_dir' line
fpm run --example mnist --profile debug --compiler ifort
I don't currently have access to my computer with 2021.10.0 on, but I'll see if I can get it on this one and see if the versions are different.
As a note to self: The line breaking it looks to be pointing to the following in src/lib/mod_full_layer.f90:
bias_diff = this%transfer%differentiate([1._real12])
I have not been able to get ifort 2021.10 on my computer, but I have got the setup running on a GitHub action now and I can confirm that I get the same error.
On the GitHub Action, the error also occurs with ifort 2021.7 and ifx 2023.2. So this might be an architecture issue.
This issue cannot be reproduced with the following Linux operating systems:
I would like to ask a question outside the initial topic of this issue, which was the report of a possible bug. Would it be possible to implement a simple convolutional network like ResNet with Athena? I ask this because I see that the design of the convolutional layers in ResNet is Conv2d -> BN -> ReLU. By looking at the source code quickly, I see that it's possible to pass None
as the activation to Conv2d
, but how can I add the activation after the BatchNorm layer? I’m still in the early stages of learning about machine learning in general, and I’m not sure if the order of layers matters, so any guidance would be greatly appreciated.
Quick update on this issue. I am having real difficulty fixing this bug as it doesn't appear on any of the machines I develop on and have access to (mostly coding on MacOS and older Linux versions, MacOS isn't supported by ifort, whilst my Linux versions don't reproduce this). GitHub Actions reproduce the issue, but testing via those is a very slow process.
I would like to ask a question outside the initial topic of this issue, which was the report of a possible bug. Would it be possible to implement a simple convolutional network like ResNet with Athena? I ask this because I see that the design of the convolutional layers in ResNet is Conv2d -> BN -> ReLU. By looking at the source code quickly, I see that it's possible to pass
None
as the activation toConv2d
, but how can I add the activation after the BatchNorm layer? I’m still in the early stages of learning about machine learning in general, and I’m not sure if the order of layers matters, so any guidance would be greatly appreciated.
The convolutional layer in ResNet that you describe cannot currently be reproduced with athena. However, this seems like something that should be implemented. Order of layers almost always matters (I only say "almost always" because there could be situations that I am unaware of). Athena was developed more in line with Tensorflow, where activation functions are built into layers. I have not, personally, had the need to have a separate layer that is just an activation function, but I can see the use of it. I will set up an issue to get it implemented. It shouldn't be too difficult, so implementation timescale shouldn't take too long. Thanks for suggesting this. :)
Break encountered with following setup:
Error encountered with batch_size > 2. Break occurs on first line of full_layer_type backward_2d procedure call (first line that either prints if pure removed, or first line of maths).
batch_size=3
causes break on delta(:,:)
evaluation line, whereas batch_size=32
causes break on first line.@ofmla Okay, it's not an issue with the code (although, maybe there should be some warning message printed by the code when it expects this to happen).
In this example, the full_layer_type needs to store the gradients of the weights and for the first full_layer_type
in the model, this is 6272 100 batch_size
(for batch_size=32
, that's over 20 million). To get around this (and this is why my other computers didn't encounter this as I have this on by default), you need to use the following command in your terminal:
ulimit -s unlimited
However, I would caution one to consider whether this is something one wants to use, as it means that the user has set their limits to unlimited which COULD cause issues if not used carefully.
Why gfortran doesn't encounter this issue when ifort does, I don't know. I guess they have different data management issues.
I won't close this issue yet as it does seem like I should add some verbose warning (and/or in debug mode) to caution the user if an array like this is going to be set to over 1 million in size.
Thank you for taking the time to investigate what the problem was
Thank you for taking the time to investigate what the problem was
No problem, thanks for bringing up the issue. 🙂 I haven't had a chance to work any more on this yet (focusing on residual network implementation instead #47). But I have found that I think one of the edits I have made relating to issue #19 actually fixes/improves this (reduces size of the temporary storage of gradients of weights by solving it one sample at a time), so I don't know if a warning message is needed anymore.
Hi, I am getting the following error when running the mnist example via fpm
my compiler version is
and the sine and simple examples run without problems