payu-org / payu

A workflow management tool for numerical models on the NCI computing systems
Apache License 2.0
21 stars 27 forks source link

Incorrect Restart File Names for MOM6 in payu 1.1 #430

Closed ezhilsabareesh8 closed 7 months ago

ezhilsabareesh8 commented 8 months ago

In the new PayU version 1.1, it has been observed that the restart file names for the MOM6 are incorrect. This issue causes MOM6 to look for files with incorrect filenames, leading to warnings such as:

WARNING from PE 0: MOM_restart: Unable to find restart file : ./GMOM_JRA.mom6.r.1900-01-02-00000_1.nc.nc
WARNING from PE 0: MOM_restart: Unable to find restart file : ./GMOM_JRA.mom6.r.1900-01-02-00000_2.nc.nc
WARNING from PE 0: MOM_restart: Unable to find restart file : ./GMOM_JRA.mom6.r.1900-01-02-00000_3.nc.nc
WARNING from PE 0: MOM_restart: Unable to find restart file : ./GMOM_JRA.mom6.r.1900-01-02-00000_4.nc.nc

As seen in the warning messages, the file extension .nc.nc is incorrect and seems to be duplicated, resulting in MOM6 being unable to locate the required restart files.

minghangli-uni commented 8 months ago

@ezhilsabareesh8 I did a quick check using payu 1.1 but couldnt reproduce your error. Starting fresh with a clean clone of ryf might resolve the issue.

Below is what I recieved in my access-om3.out,

NOTE from PE 0: MOM_restart: MOM run restarted using : ./access-om3.mom6.r.1900-02-01-00000.nc

aekiss commented 8 months ago

@minghangli-uni hits this bug only for 0.25 deg, not 1 deg: https://github.com/COSIMA/access-om3/issues/101#issuecomment-2019281472

Could it be a MOM6 configuration problem in 0.25 deg? Here's a comparison between 1deg and 0.25deg: https://github.com/COSIMA/MOM6-CICE6/compare/1deg_jra55do_ryf...025deg_jra55do_ryf_iss101

aekiss commented 8 months ago

Could the RESTART_CONTROL difference be relevant? https://github.com/COSIMA/MOM6-CICE6/compare/1deg_jra55do_ryf...025deg_jra55do_ryf_iss101#diff-bf0915852240640bb6bc6b27a0d786446acb8f242710b1757994086c2e8b91ba

aidanheerdegen commented 8 months ago

Sounds like something to add to #421 if that is something you need to always be set to a particular value.

anton-seaice commented 8 months ago

In this configuration, MOM is producing 5 restart files:

$ cat rpointer.ocn 
access-om3.mom6.r.1900-02-01-00000.nc
access-om3.mom6.r.1900-02-01-00000_1.nc
access-om3.mom6.r.1900-02-01-00000_2.nc
access-om3.mom6.r.1900-02-01-00000_3.nc
access-om3.mom6.r.1900-02-01-00000_4.nc

They are formatted 64-bit offset and have size 3.6GB. I think the maximum size for netcdf 64-bit-offset is 3.6GB, which might be why there are 5 files. (It looks like FMS configs produce multiple restart files too, just they are labelled differently).

However, payu (I guess), is not moving the files correctly after a run:

$ ls restart000/
access-om3.cice.r.1900-02-01-00000.nc  access-om3.datm.r.1900-02-01-00000.nc  access-om3.mom6.r.1900-02-01-00000.nc  rpointer.cpl  rpointer.ocn
access-om3.cpl.r.1900-02-01-00000.nc   access-om3.drof.r.1900-02-01-00000.nc  rpointer.atm                           rpointer.ice  rpointer.rof
$ ls output000/access-om3.mom6.*
output000/access-om3.mom6.h.native_1900_01.nc  output000/access-om3.mom6.h.static.nc     output000/access-om3.mom6.r.1900-02-01-00000_1.nc  output000/access-om3.mom6.r.1900-02-01-00000_3.nc
output000/access-om3.mom6.h.sfc_1900_01.nc     output000/access-om3.mom6.h.z_1900_01.nc  output000/access-om3.mom6.r.1900-02-01-00000_2.nc  output000/access-om3.mom6.r.1900-02-01-00000_4.nc

Note how restart files 1 ... 4 are in the output folder, not the restart folder.

Is it possible to configure MOM6 to use netcdf4? If not, I guess a payu update is needed?

p.s. I tested this, and the model starts from the restart if I manually moved the four extra _ restart files to the restart directory000 and then run the model.

anton-seaice commented 8 months ago

I guess this line should allow multiple lines in the pointer file and iterate over them:

https://github.com/payu-org/payu/blob/421431b52c24dd57fa0ad023d46b739667133af7/payu/models/cesm_cmeps.py#L220

dougiesquire commented 8 months ago

Whoops, I didn't know this happened and so didn't account for multiple restart files. I can fix up

anton-seaice commented 8 months ago

Thanks Dougie :)

minghangli-uni commented 8 months ago

p.s. I tested this, and the model starts from the restart if I manually moved the four extra _ restart files to the restart directory000 and then run the model.

I can see MOM can read restart files after moving the extra to the restart dir,

NOTE from PE     0: MOM_restart: MOM run restarted using : ./GMOM_JRA.mom6.r.1900-01-02-00000.nc
NOTE from PE     0: MOM_restart: MOM run restarted using : ./GMOM_JRA.mom6.r.1900-01-02-00000_1.nc
NOTE from PE     0: MOM_restart: MOM run restarted using : ./GMOM_JRA.mom6.r.1900-01-02-00000_2.nc

But I received errors in the access-om3.err,

get_stripe failed: 61 (No data available)
Abort with message NetCDF: Error initializing for parallel access in file /jobfs/98914803.gadi-pbs/mo1833/spack-stage/spack-stage-parallelio-2.5.10-hyj75i7d5yy5zbqc7jm6whlkduofib2k/spack-src/src/clib/pioc_support.c at line 2832
Abort with message NetCDF: Error initializing for parallel access in file /jobfs/98914803.gadi-pbs/mo1833/spack-stage/spack-stage-parallelio-2.5.10-hyj75i7d5yy5zbqc7jm6whlkduofib2k/spack-src/src/clib/pioc_support.c at line 2832
Abort with message NetCDF: Error initializing for parallel access in file /jobfs/98914803.gadi-pbs/mo1833/spack-stage/spack-stage-parallelio-2.5.10-hyj75i7d5yy5zbqc7jm6whlkduofib2k/spack-src/src/clib/pioc_support.c at line 2832
Abort with message NetCDF: Error initializing for parallel access in file /jobfs/98914803.gadi-pbs/mo1833/spack-stage/spack-stage-parallelio-2.5.10-hyj75i7d5yy5zbqc7jm6whlkduofib2k/spack-src/src/clib/pioc_support.c at line 2832
Abort with message NetCDF: Error initializing for parallel access in file /jobfs/98914803.gadi-pbs/mo1833/spack-stage/spack-stage-parallelio-2.5.10-hyj75i7d5yy5zbqc7jm6whlkduofib2k/spack-src/src/clib/pioc_support.c at line 2832
Abort with message NetCDF: Error initializing for parallel access in file /jobfs/98914803.gadi-pbs/mo1833/spack-stage/spack-stage-parallelio-2.5.10-hyj75i7d5yy5zbqc7jm6whlkduofib2k/spack-src/src/clib/pioc_support.c at line 2832
Abort with message NetCDF: Error initializing for parallel access in file /jobfs/98914803.gadi-pbs/mo1833/spack-stage/spack-stage-parallelio-2.5.10-hyj75i7d5yy5zbqc7jm6whlkduofib2k/spack-src/src/clib/pioc_support.c at line 2832
Abort with message NetCDF: Error initializing for parallel access in file /jobfs/98914803.gadi-pbs/mo1833/spack-stage/spack-stage-parallelio-2.5.10-hyj75i7d5yy5zbqc7jm6whlkduofib2k/spack-src/src/clib/pioc_support.c at line 2832
Abort with message NetCDF: Error initializing for parallel access in file /jobfs/98914803.gadi-pbs/mo1833/spack-stage/spack-stage-parallelio-2.5.10-hyj75i7d5yy5zbqc7jm6whlkduofib2k/spack-src/src/clib/pioc_support.c at line 2832
Abort with message NetCDF: Error initializing for parallel access in file /jobfs/98914803.gadi-pbs/mo1833/spack-stage/spack-stage-parallelio-2.5.10-hyj75i7d5yy5zbqc7jm6whlkduofib2k/spack-src/src/clib/pioc_support.c at line 2832

@anton-seaice Can you please confirm you dont have such errors?

anton-seaice commented 8 months ago

Hi Minghang. That is a bug in openmpi which prevents it doing a parallel read of files referenced through symlinks. CICE is trying to do a parallel read of ./GMOM_JRA.cice.r.*

We put a patch in the MOM6-CICE6 config (https://github.com/COSIMA/MOM6-CICE6/pull/24) whilst waiting for the openmpi 4.1.7 release which will fix this.

You just need to check that the paths in setup_cice_restarts.sh are correct and your config.yaml is still calling it (https://github.com/COSIMA/MOM6-CICE6/blob/c2585c7ddcad8c56d44026835cfd62c2800b645f/config.yaml#L33)

minghangli-uni commented 8 months ago

Fixed by substituting access-om3 with GMOM_JRAin setup_cice_restarts.sh.

dougiesquire commented 8 months ago

Fixed by substituting access-om3 with GMOM_JRAin setup_cice_restarts.sh.

@minghangli-uni it sounds like you may need to get your configuration up to date with what's on github