modern-fortran / neural-fortran

A parallel framework for deep learning
MIT License
394 stars 82 forks source link

Test failure with ifx #167

Open aminiussi opened 9 months ago

aminiussi commented 9 months ago

Hi,

Is ifx (intel nex generation fortran compiler that is replacing ifort) supported, I'm getting the following failures with ifx 2023.1.0:

The following tests FAILED:
      6 - test_maxpool2d_layer (Failed)
     12 - test_io_hdf5 (Failed)
     14 - test_dense_network_from_keras (Failed)
     17 - test_optimizers (Failed)
Errors while running CTest

On release build, these are failing on memory error.

In debug more, only test_optimizers is failing.

All this is on master

Thanks

milancurcic commented 9 months ago

Thanks for reporting. I haven't tried ifx in a while, and definitely not a recent version. I'll try it and let you know what I find.

milancurcic commented 8 months ago

Hi @aminiussi, I can't seem to reproduce this. Here's what I have:

$ ifx --version
ifx (IFX) 2023.2.0 20230721
Copyright (C) 1985-2023 Intel Corporation. All rights reserved.

HDF5 is 1.12.2 built with ifort-2021.6.

All tests pass on the latest main.

Similarly, all tests pass with ifort-2021.10.0 (that's the latest version released before deprecation in favor of ifx.

aminiussi commented 8 months ago

Hi @milancurcic,

The test in my build fails with "unmapped address" with the following stack trace:

6:  0 0x000000000004cb95 ucs_debug_print_backtrace()  ???:0
6:  1 0x0000000000415d17 nf_maxpool2d_layer_mp_backward_()  /scratch/alainm/view/neural-fortran/src/nf/nf_maxpool2d_layer_submodule.f90:107
6:  2 0x00000000004102b2 nf_layer_mp_backward_3d_()  /scratch/alainm/view/neural-fortran/src/nf/nf_layer_submodule.f90:0
6:  3 0x000000000040d2d5 MAIN__()  /scratch/alainm/view/neural-fortran/test/test_maxpool2d_layer.f90:77
14:37:01 [alainm@castor bld]# emacs /scratch/alainm/view/neural-fortran/test/test_maxpool2d_layer.f90

The element of the backtrace is weird: /scratch/alainm/view/neural-fortran/src/nf/nf_layer_submodule.f90:0 as there is no code there.

We are using hdf5 1.14.1, and the underlying gfortran is 12.2.0. Appart from that, our ifx is slightly older...

aminiussi commented 8 months ago

$ ifx --version ifx (IFX) 2023.2.0 20230721 Copyright (C) 1985-2023 Intel Corporation. All rights reserved.

Is that a parallel build and, if yes, which MPI is used ?

Thanks

aminiussi commented 8 months ago

I did a debug -check all build. The test is failing with:

forrtl: severe (408): fort: (3): Subscript #3 of the array MAXLOC_X has value 0 which is less than the lower bound of 1

In coarray image 4
Image              PC                Routine            Line        Source
test_maxpool2d_la  000000000042BD1A  backward                  106  nf_maxpool2d_layer_submodule.f90
test_maxpool2d_la  0000000000417470  backward_3d                87  nf_layer_submodule.f90
test_maxpool2d_la  000000000040E789  test_maxpool2d_la          77  test_maxpool2d_layer.f90
test_maxpool2d_la  000000000040B39D  Unknown               Unknown  Unknown
libc-2.17.so       00007FFFF3C84555  __libc_start_main     Unknown  Unknown
test_maxpool2d_la  000000000040B2CB  Unknown               Unknown  Unknown
milancurcic commented 8 months ago

Thank you, @aminiussi, this is very helpful and may be related to #145. It's possible that this is a bug that other compilers (and non-debug build modes) failed to catch but are producing incorrect results. I'll look deeper into this.

Is that a parallel build and, if yes, which MPI is used ?

I haven't built in parallel with the Intel compilers. It's Intel MPI that comes bundled with the OneAPI suite, but I don't think I configured it properly on my computer and haven't had time to dedicate to a parallel Intel build.