payu-org / payu

A workflow management tool for numerical models on the NCI computing systems
Apache License 2.0
19 stars 26 forks source link

Support JRA55 style forcing for MOM5 #68

Closed aidanheerdegen closed 4 years ago

aidanheerdegen commented 7 years ago

JRA55 forcing data sets for MOM5-SIS configuration are split into one file per year due to their large size. Changing the forcing file requires dynamically altering the entry in the data_table file.

Entries look like this:

"ATM" , "p_surf" , "SLP" , "./INPUT/slp.nc" , "bicubic" , 1.

Can we have a discussion about the best approach for this? I'll put in some suggestions below

aidanheerdegen commented 7 years ago

My preferred approach is to use jinja2 templates. Like so

"ATM" , "p_surf" , "SLP" , "./INPUT/slp.{{year}}.30Jun2016.nc" , "bicubic" , 1.

Where the JRA55 files are named like so:

slp.1958.30Jun2016.nc
slp.1959.30Jun2016.nc
slp.1960.30Jun2016.nc
slp.1961.30Jun2016.nc
slp.1962.30Jun2016.nc
slp.1963.30Jun2016.nc
slp.1964.30Jun2016.nc
slp.1965.30Jun2016.nc
slp.1966.30Jun2016.nc

The data_table file is best described as CSV, but other than that is unstructured. Using simple templating indicates to the user what will happen clearly.

I propose supporting a number of different variables:

year month day hour

Optional variables might be

season (DJF,MAM,JJA,SON)

We could add them as required.

Jinja2 has python style formatting ability

format(value, *args, **kwargs) Apply python string formatting on an object:

{{ "%s - %s"|format("Hello?", "Foo!") }} -> Hello? - Foo!

So there is no requirement to support pre-formatted variables.

Another reason for the jinja2 approach is that this can extended to other files and models. Make jinja2 processing a default step for all input files. If there are no jinja2 strings in the file, there is no change.

The MOM6 control file is not a namelist, and so would be amenable to this sort of approach for example.

In the case of MOM5, the model will need to grab the time/date information from the time_stamp.out to populate those variables.

Hopefully all models will populate these variables and make templating generally available.

nichannah commented 7 years ago

This looks good, I like the jinja2 approach.

However, what about supporting only a single variable, e.g. {{time_stamp}} which uses the ISO standard (https://en.wikipedia.org/wiki/ISO_8601). This could be used as a time point or a duration.

Also, as always, I have a preference for doing this stuff within the model. From a semantic POV doesn't it makes sense that every timestep the model takes be associated with a datetime? So when the model (re)starts it should know what date it's at and read in data corresponding to that. i.e. it should be able to read in data with the same date as the simulation. It half does this with restarts.

marshallward commented 7 years ago

If we are dynamically computing the varables to feed into jinja, can we not just dynamically work the symbolic link, and preserve the original (non-timestamped) filename?

marshallward commented 7 years ago

Also (just to repeat from my email) what about the case when a job requires data from two input files?

aidanheerdegen commented 7 years ago

I don't understand what you mean by two input files.

marshallward commented 7 years ago

If a job is longer than a year, then it will require inputs from two files.

aidanheerdegen commented 7 years ago

Oh right, gotcha. I think that is a limit the user has to deal with. Their run cannot span a longer time than their forcing. Stiff bikkies as they say.

aidanheerdegen commented 7 years ago

The argument for using jinja2 is that it can support pretty much anything a user wants to do, things that we haven't even thought of. As it is, doing the work internally in payu will only work for file names in data_table. What if they wanted to do the same thing for field_table. If we ran jinja2 on every file, a user could programatically generate their diag_table entries using jinja2 loops. If we exposed other variables like simulation length and time step the diag_table could dynamically update to suit.

I think there is a lot of potential to allow customisations that are really useful and powerful that would be a PITA to code into payu, and still not reach that level of flexibility. That is the reason I suggested this approach for this problem. Not because you can't do it another way, but it might be the way we want to do more in the future. Or maybe not.

nichannah commented 7 years ago

Jinja2 seems like a nice way to avoid needing to parse a multitude of different file formats. I think it is relevant to use it for something like time handling or other derived config. One thing I don't like is when you end up having another config file that specifies the values of template variables. i.e. we don't want to introduce any more config files.

To further my previous point about making this a model responsibility: it's a pet peeve of mine that the models do not do any quality control of incoming data. They do not even check the units of incoming variables. This results is so much wasted time and broken work. The time variable is just the same as any other here. The model should know how to read and write data with associated times and should check that these are consistent. There should be no way that the user can give Jan data to a model which is simulating Dec. If they must do this then they should change filenames and netcdf metadata.

marshallward commented 7 years ago

I'm struggling to find a case where jinja will add value. Dynamic filepaths can be sorted with symlinks, there is no need to generate filenames here. If we're just talking about moving values from, say, config.yaml into a target input file, then why not just edit the input file?

I can't pick out anything in, say field_table that would be dynamically determined. Maybe you could make output rates in diag_table to depend on runtime? Personally I'd rather just make it explicit. The one case where I could be convinced is the OASIS namcouple, though we already parse and patch those.

BTW this Jinja patching of config files is exactly what Cylc does, and what it often leads to is poorly thought out Jinja markup that creates multiple internal conflicts that have to be resolved by someone else. It also introduces another config file. I would prefer not to lose that control.

I'd also be strongly against marking up namelists or any other file with a well-defined structure and a means to manipulate within Python. Jinja2 has its own rules, which have only been designed to avoid HTML clashes, and those rules could clash with pre-existing format rules.

But I don't need to like every idea, and the worst it does is add a dependency. Given that we already depend on YAML, that's probably no big deal.

aidanheerdegen commented 7 years ago

Wow. Hard-arse (regarding Nic's time consistency), but I think I agree. With the proviso that perhaps you can turn off those checks with a config option rather than change (large) physical files. Caveat physicus.

aidanheerdegen commented 7 years ago

I can see your point Marshall. How would you signal to payu that the file name needs to altered? Use something like the jinja2 syntax?

"ATM" , "p_surf" , "SLP" , "./INPUT/slp.{{year}}.30Jun2016.nc" , "bicubic" , 1.

Or strftime style formats?

"ATM" , "p_surf" , "SLP" , "./INPUT/slp.%Y.30Jun2016.nc" , "bicubic" , 1.

nichannah commented 7 years ago

What I'm proposing addresses the comments above. The data_table could specify an input directory and a CMOR standard variable name (actually the input dir is not needed this is hard coded as INPUT in MOM). It should not need to specify files. At every forcing step the model constructs the input filename based on the variable name and the current ISO timestamp. It may be necessary to include time periods in the filename to support this. Once it reads the file in it can do a further check of the CF convention time variable within the netcdf.

Similarly I think that all model outputs (including restarts) should have an ISO timestamp in the name which matches what is in the netcdf. CICE does this and it seems nice.

marshallward commented 7 years ago

I guess I would do something like this:

I guess that would add model-specific code into the driver, and using Jinja could potentially avoid that.

I am just very burned out on this Jinja patching thing from Cylc, it gave me so many headaches, and I would not want to introduce this lightly. Part of me even thinks that if this is what we need then we should just switch to Cylc.

aidanheerdegen commented 7 years ago

What Nic is suggesting is less flexibility, not more. Which is fair enough, and in many respects is a good approach, and one informed by years of battling with bullshit input files etc. I would say though, that this places a larger up front burden on the person running the model to get their inputs into the required format. That may be effort that is well spent, but it is effort nonetheless.

In practical terms Nic's suggestion it is likely to mean people make a forest of symbolic links with the correct names to their original data.

The jinja2/cylc approach is way more flexible, but potentially wasting more time with poorly constructed config files. On the other hand, diag_table entries that are constructed programmatically by jinja2 mean adding a variable just means adding a name to a list in the file itself, rather than copying a line and altering it, which can itself be buggy.

nichannah commented 7 years ago

I feel like the only down-side to my proposal is that coding Python is fun and coding Fortran sucks. If our model was written in Python then it would be a no-brainer. I can't help thinking that we're introducing hacks just because we don't want to touch the Fortran.

Good point about the sym links Marshall. People probably would do this. But we can't stop people from doing crazy stuff. And in the long run all we have to hope that the modelling community starts to do stuff in a more consistent/standardised way that reduces the amount of hacks needed, i.e. using and checking metadata.

marshallward commented 7 years ago

Yes, I am definitely arguing for less flexibility. I also like that currently all input files are valid files, rather than proto-templates.

There are many pros and cons here, I don't want to pretend to know which will work out for the best. I am OK with you pursuing this idea. I'd only request that we use it sparingly for now.

Not to get sidetracked here, but I don't know why we dont just bundle all these JRA files into a single input. Is there a real performance loss here?

marshallward commented 7 years ago

Nic, I would say it's not just a fear of working in Fortran, but also because we are sometimes working with models over which we do not control the source code.

aidanheerdegen commented 7 years ago

I'm not sold on your approach Marshall. The information about what the "real" filename should be is not contained anywhere else currently. So it would need to be magically inferred or as you say, made into yet another config.yaml option.

There are potentially many file names to alter, here is an example data_table from the om3_core3 test:

"ATM",      "cfc_11_flux_pcair_atm", "CFC_11",       "INPUT/cfc.bc.nc", .false., 1.0e-12
"ATM",      "cfc_12_flux_pcair_atm", "CFC_12",       "INPUT/cfc.bc.nc", .false., 1.0e-12
# gridname, fieldname_code,                 fieldname_file, file_name,                              ongrid, factor
"ATM",      "c14o2_flux_abiotic_pcair_atm", "CO2",          "INPUT/ocmip2_siple_co2_atm_am2_bc-1-9999.nc", .true., 1.0e-06
"ATM",      "co2_flux_abiotic_pcair_atm",   "CO2",          "INPUT/ocmip2_siple_co2_atm_am2_bc-1-9999.nc", .true., 1.0e-06
"ATM" , "p_surf"        , "SLP" , "./INPUT/slp_.nc"                     , .false. ,  1.0
"ATM" , "p_bot"         , "SLP" , "./INPUT/slp_.nc"                     , .false. ,  1.0
"ATM" , "t_bot"         , "T_10_MOD"    , "./INPUT/t_10_mod.clim.nc"     , .false. ,  1.0
"ATM" , "sphum_bot"     , "Q_10_MOD"    , "./INPUT/q_10_mod.clim.nc"     , .false. ,  1.0
"ATM" , "u_bot"         , "U_10_MOD"    , "./INPUT/u_10_mod.clim.nc"     , .false. ,  1.0
"ATM" , "v_bot"         , "V_10_MOD"    , "./INPUT/v_10_mod.clim.nc"     , .false. ,  1.0
"ATM" , "z_bot"         , ""            , ""                             , .false. , 10.0
"ATM" , "gust"          , ""            , ""                             , .false. ,  0.0
"ICE" , "lw_flux_dn"    , "LWDN_MOD"    , "./INPUT/ncar_rad_clim.nc"     , .false. ,  1.0
#"ICE" , "sw_flux_dn"    , "SWDN_MOD"    , "./INPUT/ncar_rad_clim.nc"     , .false. ,  1.0
"ICE" , "sw_flux_vis_dir_dn", "SWDN_MOD", "./INPUT/ncar_rad_clim.nc", .false. , 0.285
"ICE" , "sw_flux_vis_dif_dn", "SWDN_MOD", "./INPUT/ncar_rad_clim.nc", .false. , 0.285
"ICE" , "sw_flux_nir_dir_dn", "SWDN_MOD", "./INPUT/ncar_rad_clim.nc", .false. , 0.215
"ICE" , "sw_flux_nir_dif_dn", "SWDN_MOD", "./INPUT/ncar_rad_clim.nc", .false. , 0.215
"ICE" , "lprec"         , "RAIN"        , "./INPUT/ncar_precip_clim.nc"  , .false. ,  1.0
"ICE" , "fprec"         , "SNOW"        , "./INPUT/ncar_precip_clim.nc"  , .false. ,  1.0
"ICE" , "runoff"        , "RUNOFF"      , "./INPUT/RUNOFF.nc"            , .true.  ,  1.0
#"ICE" , "runoff"       , ""            , ""                             , .true.  ,  0.0
"ICE" , "calving"       , ""            , ""                             , .true.  ,  0.0
"ICE" , "dhdt"          , ""            , ""                             , .true.  , 40.0
"ICE" , "dedt"          , ""            , ""                             , .true.  ,  2.0e-6
"ICE" , "drdt"          , ""            , ""                             , .true.  , 10.0
"ICE", "sic_obs"        , "SIC"         , "./INPUT/sst_ice_clim.nc"      , .false. ,  0.0
"ICE", "sit_obs"        , "SIT"         , "./INPUT/sst_ice_clim.nc"      , .false. ,  0.0
"ICE", "sst_obs"        , "SST"         , "./INPUT/sst_ice_clim.nc"      , .false. ,  1.0

We could have a dict like structure in the config.yaml with mappings, but eeuuuuwwww.

marshallward commented 7 years ago

Could we not parse data_table to get the list of times files, input.nml and *.res to get the upcoming year, and then read the JRA directory to get the true filename?

Or are you saying that it would work but it's a bad approach?

aidanheerdegen commented 7 years ago

Bundling the files is a possibility, but there are three downsides: they are bloody enormous, 131G per field, and it would mean have two copies, as others would still like them in their raw state, and every time they are updated their would have to be manual intervention to add he latest year to the aggregated dataset.

Ideally netCDF4 would support ncML and we could use a virtual dataset to aggregate them, but it doesn't.

marshallward commented 7 years ago

Well, we would not read in the entire field...

marshallward commented 7 years ago

but your other points are valid, I was not aware that JRA would be updated.

aidanheerdegen commented 7 years ago

Could we not parse data_table to get the list of times, input.nml and *.res to get the upcoming year, and then read the JRA directory to get the true filename?

I don't understand. How do we get the list of times from data_table?

nichannah commented 7 years ago

A similar approach to what I'm suggesting (which I think is the path we'll go down with MATM) is that we build support for JRA55 into MOM5.

We just move this issue from payu to the mom project.

marshallward commented 7 years ago

It was a typo, sorry, I've fixed it

aidanheerdegen commented 7 years ago

A similar approach to what I'm suggesting (which I think is the path we'll go down with MATM) is that we build support for JRA55 into MOM5.

Really? Not sure I'm a fan TBH.

ScottWales commented 7 years ago

If templating is an issue, maybe supply the input files as a glob - INPUT/slp.nc: $JRA55/slp.*.nc. Payu knows what the model date is, it can search through the listed files, find the one appropriate for this run, then set up a symlink to the input directory.

This avoids being too specific on the input data set, and errors can be raised if no file matches the dates of the current run.

nichannah commented 7 years ago

Really? Not sure I'm a fan TBH.

The reason we're having this conversation is that people haven't adopted standards for forcing dataset publication. But this is happening now because it's the only efficient way for models, analyses, data etc. to interoperate.

aidanheerdegen commented 7 years ago

The reason we're having this conversation is that people haven't adopted standards for forcing dataset publication. But this is happening now because it's the only efficient way for models, analyses, data etc. to interoperate.

Good point. So rather than add code to support JRA55, add code to support standards compliant forcing datasets? I'd support that.

marshallward commented 7 years ago

@ScottWales I think you'd still have to consider that a template, since MOM would have no idea what to do with the glob. Maybe it is a bit safer though, since the context is very clearly a path.

@aidanheerdegen I am not necessarily opposted to the idea of using Jinja. I am just not a fan and would prefer to continue manually working with particular config files. But if you want to do it, then I'll support you.

aidanheerdegen commented 7 years ago

@nicjhan So, first step, define a standards compliant forcing dataset?

nichannah commented 7 years ago

@aidanheerdegen yes. Let's use JRA55 as a starting place. :)

aidanheerdegen commented 7 years ago

If templating is an issue, maybe supply the input files as a glob - INPUT/slp.nc: $JRA55/slp.*.nc. Payu knows what the model date is, it can search through the listed files, find the one appropriate for this run, then set up a symlink to the input directory.

This is an interesting idea. Finding the correct file is maybe not quite as easy as it seems. On first blush loading them all up with an MFdataset call might be ok, but the issues with handling non-standard out of range dates in netCDF4 python are well known.

aidanheerdegen commented 7 years ago

Let's use JRA55 as a starting place.

Except that JRA55 is not using CMOR compatible variable names, slp should be psl. Or are you saying start with JRA55 and make it conform to CMOR?

aidanheerdegen commented 7 years ago

Another option is to add the time string parsing that MOM currently supports for diag_table to data_table. It is rudimentary, and is specified here:

https://github.com/mom-ocean/MOM5/blob/43986e20236cc74b83cd5ee09ebd798f4da40824/src/shared/diag_manager/diag_table.F90#L49

The function that implements it is here:

https://github.com/mom-ocean/MOM5/blob/43986e20236cc74b83cd5ee09ebd798f4da40824/src/shared/diag_manager/diag_util.F90#L1972

marshallward commented 7 years ago

I like this idea. We can also bounce it back to GFDL.

aidanheerdegen commented 7 years ago

I have opened a MOM issue for this

https://github.com/mom-ocean/MOM5/issues/170

aekiss commented 5 years ago

Can we close this issue? Has libaccessom2 addressed all these concerns for ACCESS-OM2? (and is ACCESS-OM2 this the only model setup in which JRA55 is used?)