Closed aidanheerdegen closed 4 years ago
My preferred approach is to use jinja2 templates. Like so
"ATM" , "p_surf" , "SLP" , "./INPUT/slp.{{year}}.30Jun2016.nc" , "bicubic" , 1.
Where the JRA55 files are named like so:
slp.1958.30Jun2016.nc
slp.1959.30Jun2016.nc
slp.1960.30Jun2016.nc
slp.1961.30Jun2016.nc
slp.1962.30Jun2016.nc
slp.1963.30Jun2016.nc
slp.1964.30Jun2016.nc
slp.1965.30Jun2016.nc
slp.1966.30Jun2016.nc
The data_table
file is best described as CSV, but other than that is unstructured. Using simple templating indicates to the user what will happen clearly.
I propose supporting a number of different variables:
year month day hour
Optional variables might be
season (DJF,MAM,JJA,SON)
We could add them as required.
Jinja2 has python style formatting ability
format(value, *args, **kwargs) Apply python string formatting on an object:
{{ "%s - %s"|format("Hello?", "Foo!") }}
-> Hello? - Foo!
So there is no requirement to support pre-formatted variables.
Another reason for the jinja2 approach is that this can extended to other files and models. Make jinja2 processing a default step for all input files. If there are no jinja2 strings in the file, there is no change.
The MOM6 control file is not a namelist, and so would be amenable to this sort of approach for example.
In the case of MOM5, the model will need to grab the time/date information from the time_stamp.out
to populate those variables.
Hopefully all models will populate these variables and make templating generally available.
This looks good, I like the jinja2 approach.
However, what about supporting only a single variable, e.g. {{time_stamp}} which uses the ISO standard (https://en.wikipedia.org/wiki/ISO_8601). This could be used as a time point or a duration.
Also, as always, I have a preference for doing this stuff within the model. From a semantic POV doesn't it makes sense that every timestep the model takes be associated with a datetime? So when the model (re)starts it should know what date it's at and read in data corresponding to that. i.e. it should be able to read in data with the same date as the simulation. It half does this with restarts.
If we are dynamically computing the varables to feed into jinja, can we not just dynamically work the symbolic link, and preserve the original (non-timestamped) filename?
Also (just to repeat from my email) what about the case when a job requires data from two input files?
I don't understand what you mean by two input files.
If a job is longer than a year, then it will require inputs from two files.
Oh right, gotcha. I think that is a limit the user has to deal with. Their run cannot span a longer time than their forcing. Stiff bikkies as they say.
The argument for using jinja2 is that it can support pretty much anything a user wants to do, things that we haven't even thought of. As it is, doing the work internally in payu will only work for file names in data_table
. What if they wanted to do the same thing for field_table
. If we ran jinja2 on every file, a user could programatically generate their diag_table
entries using jinja2 loops. If we exposed other variables like simulation length and time step the diag_table
could dynamically update to suit.
I think there is a lot of potential to allow customisations that are really useful and powerful that would be a PITA to code into payu, and still not reach that level of flexibility. That is the reason I suggested this approach for this problem. Not because you can't do it another way, but it might be the way we want to do more in the future. Or maybe not.
Jinja2 seems like a nice way to avoid needing to parse a multitude of different file formats. I think it is relevant to use it for something like time handling or other derived config. One thing I don't like is when you end up having another config file that specifies the values of template variables. i.e. we don't want to introduce any more config files.
To further my previous point about making this a model responsibility: it's a pet peeve of mine that the models do not do any quality control of incoming data. They do not even check the units of incoming variables. This results is so much wasted time and broken work. The time variable is just the same as any other here. The model should know how to read and write data with associated times and should check that these are consistent. There should be no way that the user can give Jan data to a model which is simulating Dec. If they must do this then they should change filenames and netcdf metadata.
I'm struggling to find a case where jinja will add value. Dynamic filepaths can be sorted with symlinks, there is no need to generate filenames here. If we're just talking about moving values from, say, config.yaml
into a target input file, then why not just edit the input file?
I can't pick out anything in, say field_table
that would be dynamically determined. Maybe you could make output rates in diag_table
to depend on runtime? Personally I'd rather just make it explicit. The one case where I could be convinced is the OASIS namcouple
, though we already parse and patch those.
BTW this Jinja patching of config files is exactly what Cylc does, and what it often leads to is poorly thought out Jinja markup that creates multiple internal conflicts that have to be resolved by someone else. It also introduces another config file. I would prefer not to lose that control.
I'd also be strongly against marking up namelists or any other file with a well-defined structure and a means to manipulate within Python. Jinja2 has its own rules, which have only been designed to avoid HTML clashes, and those rules could clash with pre-existing format rules.
But I don't need to like every idea, and the worst it does is add a dependency. Given that we already depend on YAML, that's probably no big deal.
Wow. Hard-arse (regarding Nic's time consistency), but I think I agree. With the proviso that perhaps you can turn off those checks with a config option rather than change (large) physical files. Caveat physicus.
I can see your point Marshall. How would you signal to payu that the file name needs to altered? Use something like the jinja2 syntax?
"ATM" , "p_surf" , "SLP" , "./INPUT/slp.{{year}}.30Jun2016.nc" , "bicubic" , 1.
Or strftime style formats?
"ATM" , "p_surf" , "SLP" , "./INPUT/slp.%Y.30Jun2016.nc" , "bicubic" , 1.
What I'm proposing addresses the comments above. The data_table could specify an input directory and a CMOR standard variable name (actually the input dir is not needed this is hard coded as INPUT in MOM). It should not need to specify files. At every forcing step the model constructs the input filename based on the variable name and the current ISO timestamp. It may be necessary to include time periods in the filename to support this. Once it reads the file in it can do a further check of the CF convention time variable within the netcdf.
Similarly I think that all model outputs (including restarts) should have an ISO timestamp in the name which matches what is in the netcdf. CICE does this and it seems nice.
I guess I would do something like this:
./INPUT/slp.nc
config.yaml
config.yaml
? Or parsed from input.nml
and *.res
files?) and then create a symlink from that file to slp.nc
I guess that would add model-specific code into the driver, and using Jinja could potentially avoid that.
I am just very burned out on this Jinja patching thing from Cylc, it gave me so many headaches, and I would not want to introduce this lightly. Part of me even thinks that if this is what we need then we should just switch to Cylc.
What Nic is suggesting is less flexibility, not more. Which is fair enough, and in many respects is a good approach, and one informed by years of battling with bullshit input files etc. I would say though, that this places a larger up front burden on the person running the model to get their inputs into the required format. That may be effort that is well spent, but it is effort nonetheless.
In practical terms Nic's suggestion it is likely to mean people make a forest of symbolic links with the correct names to their original data.
The jinja2/cylc approach is way more flexible, but potentially wasting more time with poorly constructed config files. On the other hand, diag_table entries that are constructed programmatically by jinja2 mean adding a variable just means adding a name to a list in the file itself, rather than copying a line and altering it, which can itself be buggy.
I feel like the only down-side to my proposal is that coding Python is fun and coding Fortran sucks. If our model was written in Python then it would be a no-brainer. I can't help thinking that we're introducing hacks just because we don't want to touch the Fortran.
Good point about the sym links Marshall. People probably would do this. But we can't stop people from doing crazy stuff. And in the long run all we have to hope that the modelling community starts to do stuff in a more consistent/standardised way that reduces the amount of hacks needed, i.e. using and checking metadata.
Yes, I am definitely arguing for less flexibility. I also like that currently all input files are valid files, rather than proto-templates.
There are many pros and cons here, I don't want to pretend to know which will work out for the best. I am OK with you pursuing this idea. I'd only request that we use it sparingly for now.
Not to get sidetracked here, but I don't know why we dont just bundle all these JRA files into a single input. Is there a real performance loss here?
Nic, I would say it's not just a fear of working in Fortran, but also because we are sometimes working with models over which we do not control the source code.
I'm not sold on your approach Marshall. The information about what the "real" filename should be is not contained anywhere else currently. So it would need to be magically inferred or as you say, made into yet another config.yaml option.
There are potentially many file names to alter, here is an example data_table
from the om3_core3
test:
"ATM", "cfc_11_flux_pcair_atm", "CFC_11", "INPUT/cfc.bc.nc", .false., 1.0e-12
"ATM", "cfc_12_flux_pcair_atm", "CFC_12", "INPUT/cfc.bc.nc", .false., 1.0e-12
# gridname, fieldname_code, fieldname_file, file_name, ongrid, factor
"ATM", "c14o2_flux_abiotic_pcair_atm", "CO2", "INPUT/ocmip2_siple_co2_atm_am2_bc-1-9999.nc", .true., 1.0e-06
"ATM", "co2_flux_abiotic_pcair_atm", "CO2", "INPUT/ocmip2_siple_co2_atm_am2_bc-1-9999.nc", .true., 1.0e-06
"ATM" , "p_surf" , "SLP" , "./INPUT/slp_.nc" , .false. , 1.0
"ATM" , "p_bot" , "SLP" , "./INPUT/slp_.nc" , .false. , 1.0
"ATM" , "t_bot" , "T_10_MOD" , "./INPUT/t_10_mod.clim.nc" , .false. , 1.0
"ATM" , "sphum_bot" , "Q_10_MOD" , "./INPUT/q_10_mod.clim.nc" , .false. , 1.0
"ATM" , "u_bot" , "U_10_MOD" , "./INPUT/u_10_mod.clim.nc" , .false. , 1.0
"ATM" , "v_bot" , "V_10_MOD" , "./INPUT/v_10_mod.clim.nc" , .false. , 1.0
"ATM" , "z_bot" , "" , "" , .false. , 10.0
"ATM" , "gust" , "" , "" , .false. , 0.0
"ICE" , "lw_flux_dn" , "LWDN_MOD" , "./INPUT/ncar_rad_clim.nc" , .false. , 1.0
#"ICE" , "sw_flux_dn" , "SWDN_MOD" , "./INPUT/ncar_rad_clim.nc" , .false. , 1.0
"ICE" , "sw_flux_vis_dir_dn", "SWDN_MOD", "./INPUT/ncar_rad_clim.nc", .false. , 0.285
"ICE" , "sw_flux_vis_dif_dn", "SWDN_MOD", "./INPUT/ncar_rad_clim.nc", .false. , 0.285
"ICE" , "sw_flux_nir_dir_dn", "SWDN_MOD", "./INPUT/ncar_rad_clim.nc", .false. , 0.215
"ICE" , "sw_flux_nir_dif_dn", "SWDN_MOD", "./INPUT/ncar_rad_clim.nc", .false. , 0.215
"ICE" , "lprec" , "RAIN" , "./INPUT/ncar_precip_clim.nc" , .false. , 1.0
"ICE" , "fprec" , "SNOW" , "./INPUT/ncar_precip_clim.nc" , .false. , 1.0
"ICE" , "runoff" , "RUNOFF" , "./INPUT/RUNOFF.nc" , .true. , 1.0
#"ICE" , "runoff" , "" , "" , .true. , 0.0
"ICE" , "calving" , "" , "" , .true. , 0.0
"ICE" , "dhdt" , "" , "" , .true. , 40.0
"ICE" , "dedt" , "" , "" , .true. , 2.0e-6
"ICE" , "drdt" , "" , "" , .true. , 10.0
"ICE", "sic_obs" , "SIC" , "./INPUT/sst_ice_clim.nc" , .false. , 0.0
"ICE", "sit_obs" , "SIT" , "./INPUT/sst_ice_clim.nc" , .false. , 0.0
"ICE", "sst_obs" , "SST" , "./INPUT/sst_ice_clim.nc" , .false. , 1.0
We could have a dict like structure in the config.yaml with mappings, but eeuuuuwwww.
Could we not parse data_table
to get the list of times files, input.nml
and *.res
to get the upcoming year, and then read the JRA directory to get the true filename?
Or are you saying that it would work but it's a bad approach?
Bundling the files is a possibility, but there are three downsides: they are bloody enormous, 131G per field, and it would mean have two copies, as others would still like them in their raw state, and every time they are updated their would have to be manual intervention to add he latest year to the aggregated dataset.
Ideally netCDF4 would support ncML and we could use a virtual dataset to aggregate them, but it doesn't.
Well, we would not read in the entire field...
but your other points are valid, I was not aware that JRA would be updated.
Could we not parse data_table to get the list of times, input.nml and *.res to get the upcoming year, and then read the JRA directory to get the true filename?
I don't understand. How do we get the list of times from data_table
?
A similar approach to what I'm suggesting (which I think is the path we'll go down with MATM) is that we build support for JRA55 into MOM5.
We just move this issue from payu to the mom project.
It was a typo, sorry, I've fixed it
A similar approach to what I'm suggesting (which I think is the path we'll go down with MATM) is that we build support for JRA55 into MOM5.
Really? Not sure I'm a fan TBH.
If templating is an issue, maybe supply the input files as a glob - INPUT/slp.nc: $JRA55/slp.*.nc
. Payu knows what the model date is, it can search through the listed files, find the one appropriate for this run, then set up a symlink to the input directory.
This avoids being too specific on the input data set, and errors can be raised if no file matches the dates of the current run.
Really? Not sure I'm a fan TBH.
The reason we're having this conversation is that people haven't adopted standards for forcing dataset publication. But this is happening now because it's the only efficient way for models, analyses, data etc. to interoperate.
The reason we're having this conversation is that people haven't adopted standards for forcing dataset publication. But this is happening now because it's the only efficient way for models, analyses, data etc. to interoperate.
Good point. So rather than add code to support JRA55, add code to support standards compliant forcing datasets? I'd support that.
@ScottWales I think you'd still have to consider that a template, since MOM would have no idea what to do with the glob. Maybe it is a bit safer though, since the context is very clearly a path.
@aidanheerdegen I am not necessarily opposted to the idea of using Jinja. I am just not a fan and would prefer to continue manually working with particular config files. But if you want to do it, then I'll support you.
@nicjhan So, first step, define a standards compliant forcing dataset?
@aidanheerdegen yes. Let's use JRA55 as a starting place. :)
If templating is an issue, maybe supply the input files as a glob - INPUT/slp.nc: $JRA55/slp.*.nc. Payu knows what the model date is, it can search through the listed files, find the one appropriate for this run, then set up a symlink to the input directory.
This is an interesting idea. Finding the correct file is maybe not quite as easy as it seems. On first blush loading them all up with an MFdataset
call might be ok, but the issues with handling non-standard out of range dates in netCDF4 python are well known.
Let's use JRA55 as a starting place.
Except that JRA55 is not using CMOR compatible variable names, slp
should be psl
. Or are you saying start with JRA55 and make it conform to CMOR?
Another option is to add the time string parsing that MOM currently supports for diag_table
to data_table
. It is rudimentary, and is specified here:
The function that implements it is here:
I like this idea. We can also bounce it back to GFDL.
I have opened a MOM issue for this
Can we close this issue? Has libaccessom2 addressed all these concerns for ACCESS-OM2? (and is ACCESS-OM2 this the only model setup in which JRA55 is used?)
JRA55 forcing data sets for MOM5-SIS configuration are split into one file per year due to their large size. Changing the forcing file requires dynamically altering the entry in the data_table file.
Entries look like this:
"ATM" , "p_surf" , "SLP" , "./INPUT/slp.nc" , "bicubic" , 1.
Can we have a discussion about the best approach for this? I'll put in some suggestions below