payu-org / payu

A workflow management tool for numerical models on the NCI computing systems
Apache License 2.0
18 stars 25 forks source link

ACCESS-OM2 crashes reading atmosphere/input.nml #274

Open aekiss opened 4 years ago

aekiss commented 4 years ago

On rare occasions ACCESS-OM2 crashes with

forrtl: severe (24): end-of-file during read, unit -129, file /scratch/x77/aek156/access-om2/work/01deg_jra55v140_iaf/atmosphere/input.nml
Image              PC                Routine            Line        Source
fms_ACCESS-OM_08c  0000000002EC7F4B  Unknown               Unknown  Unknown
fms_ACCESS-OM_08c  0000000002F0571E  Unknown               Unknown  Unknown
fms_ACCESS-OM_08c  000000000040FBA7  MAIN__.V                  183  ocean_solo.F90
fms_ACCESS-OM_08c  000000000040F922  Unknown               Unknown  Unknown
libc-2.28.so       0000149968F1D873  __libc_start_main     Unknown  Unknown
fms_ACCESS-OM_08c  000000000040F82E  Unknown               Unknown  Unknown

for example, see /scratch/x77/aek156/access-om2/archive/01deg_jra55v140_iaf/*_logs/*7187716*

I think @penguian mentioned he had the same issue last week.

It's non-reproducible - sweeping and resubmitting fixes the problem, so I've modified resub.sh to include atmosphere/input.nml.

This is a weird issue - MOM is looking for input.nml in atmosphere, not ocean. input.nml is not present in atmosphere in the control directory, but work/atmosphere/input.nml exists (and is empty) in a crashed run:

$ ls -l work/atmosphere/
total 152
-rw-r--r-- 1 aek156 x77    348 Jun 17 23:46 atm.nml
-rw-r--r-- 1 aek156 x77   2189 Jun 17 23:46 forcing.json
drwxr-s--- 2 aek156 x77 131072 Jun 17 23:46 INPUT
-rw-r----- 1 aek156 x77      0 Jun 17 23:47 input.nml
drwxr-s--- 2 aek156 x77  16384 Jun 17 23:46 log
lrwxrwxrwx 1 aek156 x77     51 Jun 17 23:46 yatm_a6e5d87.exe -> /g/data/ik11/inputs/access-om2/bin/yatm_a6e5d87.exe

whereas in a normal run work/atmosphere/input.nml doesn't exist:

$ ls -l work/atmosphere/
total 276
-rw-r--r-- 1 aek156 x77    348 Jun 18 07:21 atm.nml
-rw-r----- 1 aek156 x77     65 Jun 18 07:22 debug.root.02
-rw-r--r-- 1 aek156 x77   2189 Jun 18 07:21 forcing.json
drwxr-s--- 2 aek156 x77 131072 Jun 18 07:21 INPUT
drwxr-s--- 2 aek156 x77  16384 Jun 18 07:22 log
-rw-r----- 1 aek156 x77 119840 Jun 18 07:22 nout.000000
lrwxrwxrwx 1 aek156 x77     51 Jun 18 07:21 yatm_a6e5d87.exe -> /g/data/ik11/inputs/access-om2/bin/yatm_a6e5d87.exe

also I'm not sure if it's relevant but /scratch/x77/aek156/access-om2/archive/01deg_jra55v140_iaf/pbs_logs/01deg_jra55_iaf.e7187716 contains

[gadi-cpu-clx-0675.gadi.nci.org.au:03917] PMIX ERROR: UNREACHABLE in file ../../../../../../../../../../opal/mca/pmix/pmix3x/pmix/src/mca/ptl/tcp/ptl_tcp_component.c at line 1700
[gadi-cpu-clx-0675.gadi.nci.org.au:03917] PMIX ERROR: UNREACHABLE in file ../../../../../../../../../../opal/mca/pmix/pmix3x/pmix/src/mca/ptl/tcp/ptl_tcp_component.c at line 1744
[gadi-cpu-clx-0675.gadi.nci.org.au:03917] PMIX ERROR: UNREACHABLE in file ../../../../../../../../../../opal/mca/pmix/pmix3x/pmix/src/mca/ptl/tcp/ptl_tcp_component.c at line 1700
[gadi-cpu-clx-0675.gadi.nci.org.au:03917] PMIX ERROR: UNREACHABLE in file ../../../../../../../../../../opal/mca/pmix/pmix3x/pmix/src/mca/ptl/tcp/ptl_tcp_component.c at line 1744
[gadi-cpu-clx-0652.gadi.nci.org.au:56339] PMIX ERROR: UNREACHABLE in file ../../../../../../../../../../opal/mca/pmix/pmix3x/pmix/src/mca/ptl/tcp/ptl_tcp_component.c at line 1700
[gadi-cpu-clx-0652.gadi.nci.org.au:56339] PMIX ERROR: UNREACHABLE in file ../../../../../../../../../../opal/mca/pmix/pmix3x/pmix/src/mca/ptl/tcp/ptl_tcp_component.c at line 1744
[gadi-cpu-clx-0652.gadi.nci.org.au:56339] PMIX ERROR: UNREACHABLE in file ../../../../../../../../../../opal/mca/pmix/pmix3x/pmix/src/mca/ptl/tcp/ptl_tcp_component.c at line 1700
[gadi-cpu-clx-0652.gadi.nci.org.au:56339] PMIX ERROR: UNREACHABLE in file ../../../../../../../../../../opal/mca/pmix/pmix3x/pmix/src/mca/ptl/tcp/ptl_tcp_component.c at line 1744
[gadi-cpu-clx-0652.gadi.nci.org.au:56339] PMIX ERROR: UNREACHABLE in file ../../../../../../../../../../opal/mca/pmix/pmix3x/pmix/src/mca/ptl/tcp/ptl_tcp_component.c at line 1700
[gadi-cpu-clx-0652.gadi.nci.org.au:56339] PMIX ERROR: UNREACHABLE in file ../../../../../../../../../../opal/mca/pmix/pmix3x/pmix/src/mca/ptl/tcp/ptl_tcp_component.c at line 1744
[gadi-cpu-clx-0652.gadi.nci.org.au:56339] PMIX ERROR: UNREACHABLE in file ../../../../../../../../../../opal/mca/pmix/pmix3x/pmix/src/mca/ptl/tcp/ptl_tcp_component.c at line 1700
[gadi-cpu-clx-0652.gadi.nci.org.au:56339] PMIX ERROR: UNREACHABLE in file ../../../../../../../../../../opal/mca/pmix/pmix3x/pmix/src/mca/ptl/tcp/ptl_tcp_component.c at line 1744
[gadi-cpu-clx-0040.gadi.nci.org.au:19442] [[32047,0],41] ORTE_ERROR_LOG: Not found in file ../../../../orte/mca/grpcomm/base/grpcomm_base_stubs.c at line 354
[gadi-cpu-clx-0040.gadi.nci.org.au:19442] [[32047,0],41] ORTE_ERROR_LOG: Not found in file ../../../../orte/mca/grpcomm/base/grpcomm_base_stubs.c at line 278
[gadi-cpu-clx-0040.gadi.nci.org.au:19442] [[32047,0],41] ORTE_ERROR_LOG: Not found in file ../../../../../orte/mca/grpcomm/direct/grpcomm_direct.c at line 187
aidanheerdegen commented 4 years ago

Weird. That is here:

https://github.com/mom-ocean/MOM5/blob/master/src/accessom_coupler/ocean_solo.F90#L182-L183

which suggests that the ocean model thinks its working directory is work/atmosphere. I wonder how that can happen. Unfortunately we don't capture the payu run command line, which might be interesting to see to confirm nothing odd happened there. Extremely unlikely, but worth ruling out.

aekiss commented 4 years ago

it's also odd that input.nml actually exists in atmosphere (albeit as an empty file)