payu-org / payu

A workflow management tool for numerical models on the NCI computing systems
Apache License 2.0
18 stars 25 forks source link

Collating regional outputs for core counts exceeding 10,000 #309

Closed AndyHoggANU closed 2 years ago

AndyHoggANU commented 2 years ago

I have an ACCESS-OM2-01 simulation where I am trying to save some regional diagnostics. The simulation uses Andrew’s 10461 core count for MOM, meaning that the regional diagnostics routine (which writes out 1 netcdf tile per core) now has 6 digits in the filename after the .nc — like rregionocean-2d30m-vorticity_z-3-hourly-mean-ym_2012_01.nc.010431 .

It seems that payu doesn't ask mppnccombine to collate these files, likely because of this: https://github.com/payu-org/payu/blob/9348acdf92ca18aae229fc06b0b716d4cd85e1aa/payu/models/fms.py#L65-L66

Is there a nice way to generalise this bit of code?

AndyHoggANU commented 2 years ago

PS. This is using the old mppnccombine -- because we think mppnccombine-fast doesn't cope well with the regional outputs, because it misses the coordinates of masked tiles.

aekiss commented 2 years ago

I thought the issue was that the output of mppnccombine-fast would have one chunk per core. But maybe we could re-chunk it to make it useable.

aekiss commented 2 years ago

I'd suggest changing that payu code to something like

 tile_fnames = [f for f in glob(os.path.join(dir, '*.nc.*')) 
                if f.split('.')[-1].isdigit() and f.split('.')[-2] == 'nc'] 

this would also need from glob import glob

aidanheerdegen commented 2 years ago

There are sorting issues as well once the zero-padding runs out. I decided to use pathlib because it is cleaner. Created some tests to check it is doing the right thing too.

aidanheerdegen commented 2 years ago

Clearly I need to read more carefully. Those suffixes are zero-padded.

aidanheerdegen commented 2 years ago

I have pushed a new tag (1.0.22) which should show up in conda/analysis3-unstable in 30-40 minutes all things being equal

aidanheerdegen commented 2 years ago

ping @AndyHoggANU

AndyHoggANU commented 2 years ago

Ping yourself ... I gave this a try. Am using conda/analysis3-unstable but I find:

Maybe it just hasn't gone through yet?

aidanheerdegen commented 2 years ago

Yeah, the conda update errored, still the previous version.

$ conda list payu
# packages in environment at /g/data3/hh5/public/apps/miniconda3/envs/analysis3-21.07:
#
# Name                    Version                   Build  Channel
payu                      1.0.21                     py_0    coecms
aidanheerdegen commented 2 years ago

Sorry, the conda install is broken. I tried a quick fix, but didn't work. Will have to wait until Monday I am afraid.

aidanheerdegen commented 2 years ago

Or you can load the conda/python3 environment then try pip install --user directly from GitHub, or clone payu and pip install . --user and then use ~/.local/bin/payu

aidanheerdegen commented 2 years ago

@AndyHoggANU Fixed the conda update, give it a crack

AndyHoggANU commented 2 years ago

Didn't realise it was Monday already. ;-)

Anyway, I tried this -- conda has indeed updated but I get this error:

[amh157@gadi-login-02 01deg_jra55v140_iaf_cycle3_HF]$ more 01deg_jra55_i_c.e27781306
Traceback (most recent call last):
  File "/g/data3/hh5/public/apps/miniconda3/envs/analysis3-21.07/bin/payu-collate", line 10, in <modu
le>
    sys.exit(runscript())
  File "/g/data/hh5/public/apps/miniconda3/envs/analysis3-21.07/lib/python3.9/site-packages/payu/subc
ommands/collate_cmd.py", line 111, in runscript
    expt.collate()
  File "/g/data/hh5/public/apps/miniconda3/envs/analysis3-21.07/lib/python3.9/site-packages/payu/expe
riment.py", line 814, in collate
    model.collate()
  File "/g/data/hh5/public/apps/miniconda3/envs/analysis3-21.07/lib/python3.9/site-packages/payu/mode
ls/fms.py", line 143, in collate
    fnames = Fms.get_uncollated_files(self.output_path)
  File "/g/data/hh5/public/apps/miniconda3/envs/analysis3-21.07/lib/python3.9/site-packages/payu/mode
ls/fms.py", line 66, in get_uncollated_files
    tile_fnames = [f for f in Path(dir).iterdir()
  File "/g/data/hh5/public/apps/miniconda3/envs/analysis3-21.07/lib/python3.9/site-packages/payu/mode
ls/fms.py", line 68, in <listcomp>
    f.suffixes[1][1:].isdigit()]
IndexError: list index out of range

Not sure which list index it is referring to.

aidanheerdegen commented 2 years ago

@AndyHoggANU can you try again

AndyHoggANU commented 2 years ago

Yep, trying now -- will keep you posted.

AndyHoggANU commented 2 years ago

BTW, it appears to be working ... but very slowly. I think this is characteristic of combing from regional diagnostics, so will just let it run its course.

aidanheerdegen commented 2 years ago

It may be faster to use mppnccombine-fast with an option which will force it to recompress the data, which overcomes the chunking issue. e.g. -d 4