Closed ezhilsabareesh8 closed 7 months ago
@ezhilsabareesh8 I did a quick check using payu 1.1
but couldnt reproduce your error. Starting fresh with a clean clone of ryf might resolve the issue.
Below is what I recieved in my access-om3.out
,
NOTE from PE 0: MOM_restart: MOM run restarted using : ./access-om3.mom6.r.1900-02-01-00000.nc
@minghangli-uni hits this bug only for 0.25 deg, not 1 deg: https://github.com/COSIMA/access-om3/issues/101#issuecomment-2019281472
Could it be a MOM6 configuration problem in 0.25 deg? Here's a comparison between 1deg and 0.25deg: https://github.com/COSIMA/MOM6-CICE6/compare/1deg_jra55do_ryf...025deg_jra55do_ryf_iss101
Could the RESTART_CONTROL
difference be relevant? https://github.com/COSIMA/MOM6-CICE6/compare/1deg_jra55do_ryf...025deg_jra55do_ryf_iss101#diff-bf0915852240640bb6bc6b27a0d786446acb8f242710b1757994086c2e8b91ba
Sounds like something to add to #421 if that is something you need to always be set to a particular value.
In this configuration, MOM is producing 5 restart files:
$ cat rpointer.ocn
access-om3.mom6.r.1900-02-01-00000.nc
access-om3.mom6.r.1900-02-01-00000_1.nc
access-om3.mom6.r.1900-02-01-00000_2.nc
access-om3.mom6.r.1900-02-01-00000_3.nc
access-om3.mom6.r.1900-02-01-00000_4.nc
They are formatted 64-bit offset and have size 3.6GB. I think the maximum size for netcdf 64-bit-offset is 3.6GB, which might be why there are 5 files. (It looks like FMS configs produce multiple restart files too, just they are labelled differently).
However, payu (I guess), is not moving the files correctly after a run:
$ ls restart000/
access-om3.cice.r.1900-02-01-00000.nc access-om3.datm.r.1900-02-01-00000.nc access-om3.mom6.r.1900-02-01-00000.nc rpointer.cpl rpointer.ocn
access-om3.cpl.r.1900-02-01-00000.nc access-om3.drof.r.1900-02-01-00000.nc rpointer.atm rpointer.ice rpointer.rof
$ ls output000/access-om3.mom6.*
output000/access-om3.mom6.h.native_1900_01.nc output000/access-om3.mom6.h.static.nc output000/access-om3.mom6.r.1900-02-01-00000_1.nc output000/access-om3.mom6.r.1900-02-01-00000_3.nc
output000/access-om3.mom6.h.sfc_1900_01.nc output000/access-om3.mom6.h.z_1900_01.nc output000/access-om3.mom6.r.1900-02-01-00000_2.nc output000/access-om3.mom6.r.1900-02-01-00000_4.nc
Note how restart files 1 ... 4 are in the output folder, not the restart folder.
Is it possible to configure MOM6 to use netcdf4? If not, I guess a payu update is needed?
p.s. I tested this, and the model starts from the restart if I manually moved the four extra _ restart files to the restart directory000 and then run the model.
I guess this line should allow multiple lines in the pointer file and iterate over them:
Whoops, I didn't know this happened and so didn't account for multiple restart files. I can fix up
Thanks Dougie :)
p.s. I tested this, and the model starts from the restart if I manually moved the four extra _ restart files to the restart directory000 and then run the model.
I can see MOM can read restart files after moving the extra to the restart dir,
NOTE from PE 0: MOM_restart: MOM run restarted using : ./GMOM_JRA.mom6.r.1900-01-02-00000.nc
NOTE from PE 0: MOM_restart: MOM run restarted using : ./GMOM_JRA.mom6.r.1900-01-02-00000_1.nc
NOTE from PE 0: MOM_restart: MOM run restarted using : ./GMOM_JRA.mom6.r.1900-01-02-00000_2.nc
But I received errors in the access-om3.err
,
get_stripe failed: 61 (No data available)
Abort with message NetCDF: Error initializing for parallel access in file /jobfs/98914803.gadi-pbs/mo1833/spack-stage/spack-stage-parallelio-2.5.10-hyj75i7d5yy5zbqc7jm6whlkduofib2k/spack-src/src/clib/pioc_support.c at line 2832
Abort with message NetCDF: Error initializing for parallel access in file /jobfs/98914803.gadi-pbs/mo1833/spack-stage/spack-stage-parallelio-2.5.10-hyj75i7d5yy5zbqc7jm6whlkduofib2k/spack-src/src/clib/pioc_support.c at line 2832
Abort with message NetCDF: Error initializing for parallel access in file /jobfs/98914803.gadi-pbs/mo1833/spack-stage/spack-stage-parallelio-2.5.10-hyj75i7d5yy5zbqc7jm6whlkduofib2k/spack-src/src/clib/pioc_support.c at line 2832
Abort with message NetCDF: Error initializing for parallel access in file /jobfs/98914803.gadi-pbs/mo1833/spack-stage/spack-stage-parallelio-2.5.10-hyj75i7d5yy5zbqc7jm6whlkduofib2k/spack-src/src/clib/pioc_support.c at line 2832
Abort with message NetCDF: Error initializing for parallel access in file /jobfs/98914803.gadi-pbs/mo1833/spack-stage/spack-stage-parallelio-2.5.10-hyj75i7d5yy5zbqc7jm6whlkduofib2k/spack-src/src/clib/pioc_support.c at line 2832
Abort with message NetCDF: Error initializing for parallel access in file /jobfs/98914803.gadi-pbs/mo1833/spack-stage/spack-stage-parallelio-2.5.10-hyj75i7d5yy5zbqc7jm6whlkduofib2k/spack-src/src/clib/pioc_support.c at line 2832
Abort with message NetCDF: Error initializing for parallel access in file /jobfs/98914803.gadi-pbs/mo1833/spack-stage/spack-stage-parallelio-2.5.10-hyj75i7d5yy5zbqc7jm6whlkduofib2k/spack-src/src/clib/pioc_support.c at line 2832
Abort with message NetCDF: Error initializing for parallel access in file /jobfs/98914803.gadi-pbs/mo1833/spack-stage/spack-stage-parallelio-2.5.10-hyj75i7d5yy5zbqc7jm6whlkduofib2k/spack-src/src/clib/pioc_support.c at line 2832
Abort with message NetCDF: Error initializing for parallel access in file /jobfs/98914803.gadi-pbs/mo1833/spack-stage/spack-stage-parallelio-2.5.10-hyj75i7d5yy5zbqc7jm6whlkduofib2k/spack-src/src/clib/pioc_support.c at line 2832
Abort with message NetCDF: Error initializing for parallel access in file /jobfs/98914803.gadi-pbs/mo1833/spack-stage/spack-stage-parallelio-2.5.10-hyj75i7d5yy5zbqc7jm6whlkduofib2k/spack-src/src/clib/pioc_support.c at line 2832
@anton-seaice Can you please confirm you dont have such errors?
Hi Minghang. That is a bug in openmpi which prevents it doing a parallel read of files referenced through symlinks. CICE is trying to do a parallel read of ./GMOM_JRA.cice.r.*
We put a patch in the MOM6-CICE6 config (https://github.com/COSIMA/MOM6-CICE6/pull/24) whilst waiting for the openmpi 4.1.7 release which will fix this.
You just need to check that the paths in setup_cice_restarts.sh are correct and your config.yaml is still calling it (https://github.com/COSIMA/MOM6-CICE6/blob/c2585c7ddcad8c56d44026835cfd62c2800b645f/config.yaml#L33)
Fixed by substituting access-om3
with GMOM_JRA
in setup_cice_restarts.sh.
Fixed by substituting access-om3 with GMOM_JRAin setup_cice_restarts.sh.
@minghangli-uni it sounds like you may need to get your configuration up to date with what's on github
In the new PayU version 1.1, it has been observed that the restart file names for the MOM6 are incorrect. This issue causes MOM6 to look for files with incorrect filenames, leading to warnings such as:
As seen in the warning messages, the file extension .nc.nc is incorrect and seems to be duplicated, resulting in MOM6 being unable to locate the required restart files.