Closed joeloskarsson closed 1 month ago
Suggestion how the projection and regional extent can be specified:
projections:
lambert_conformal_conic:
proj_class: LambertConformal
proj_params:
central_longitude: 15.0
central_latitude: 63.3
standard_parallels: [63.3, 63.3]
mercator:
proj_class: Mercator
proj_params:
central_longitude: 0.0
min_latitude: -80.0
max_latitude: 84.0
stereographic:
proj_class: Stereographic
proj_params:
central_longitude: 0.0
central_latitude: 90.0
true_scale_latitude: 60.0
rotated_pole:
proj_class: RotatedPole
proj_params:
pole_longitude: 10.0
pole_latitude: -43.0
robinson:
proj_class: Robinson
proj_params:
central_longitude: 0.0
plate_carree:
proj_class: PlateCarree
proj_params:
central_longitude: 0.0
limits:
x_limits: [-6.82, 4.8]
y_limits: [-4.42, 3.36]
Hi all, thanks for adding me. Indeed, there is something that I didn't find in this PR (as it got big and scattered in many issues it may be there but I missed it) and that was causing me some trouble: the nature of the data (reforecast vs reanalysis) and the variables that go with it (maximum lead time -forecast only-, time step, number of time steps in a single file...).
Specifically, the number of time step in a file is currently hard-coded (N_t' = 65 in the WeatherDataset class) and I had to change it to read my data. I can give you more details on how I worked around it but a new issue would probably be more appropriate.
What do you think?
Hi @ThomasRieutord, that is a good point. We did not have reforecast support on the roadmap, but I agree that it should be. Yes, there are quite a few Issues and PRs open right now. I have a suggestion: Since the MEPS data is actually a reforecast, you could implement your solution for handling such reforecast data in https://github.com/mllam/neural-lam/pull/31? We are getting rid of everything hardcoded, so I think our goals are quite aligned :) What do you think?
PS: Do you also require the sample_length
flag for model training based on reforecasts. I feel like this flag makes the model much harder to understand and would vote to remove that option. Thoughts?
I agree that having a Dataset class also for training on (re-)forecasts would be nice. But at the same time I do not mind changing everything to assuming (re-)analysis now first. Then we can add back such handling of forecast data again later.
The current handling of sample_length
is indeed convoluted and make things hard to understand. This can likely be cleaned up by a nicer implementation. There still needs to be 1) a way to set pred_length, and extract a sub-forecast to train on as a sample and 2) a check that you are not trying to unroll the model for more steps than there are time steps in the forecast.
Regarding the boundary data configuration. How do we specify the grid for the host model if it is on a global grid? Most ERA5 versions are on a regular lat-lon grid (although native grid is reduced Gaussian) white the default grid for the (A)IFS HRES (probable choice in an operational setting) I guess is a (reduced) Gaussian grid (https://confluence.ecmwf.int/display/UDOC/Gaussian+grids).
Start by only supporting (assuming) regular lat-lon? However this causes the grid node distribution to be quite uneven towards the poles (probably quite noticeable for the MEPS domain). This will then have an effect on the g2m where some nodes will have contributions from many more grid points than others? @joeloskarsson maybe already solved this for the global model?
Does splitting the "state" into surface, atmospheric and levels add value (e.g. what if all variables are not at all levels)? What about just having a single list with state variables defined by names from ECMWF or CF name tables: https://codes.ecmwf.int/grib/param-db/ https://cfconventions.org/Data/cf-standard-names/current/build/cf-standard-name-table.html
Hi @sadamov and @joeloskarsson, I think #31 is in good shape and answering a slightly different problem, so I would rather have it done and remove the hard-coded 65 time steps in another issue.
Just to let you know how I did so far.
On the one hand, the MEPS example is an ensemble forecast archive (so format = reforecast). For example the file nwp_2022040100_mbr000
contains 65-h hourly forecast from member 0 starting from analysis at 2022-04-01 00Z.
On the other hand, MERA is a deterministic reanalysis with 3h between each analysis. To stick as much as possible to the provided MEPS example, the file nwp_2022040100_mbr000
contains 21 3-hourly analysis starting at 2022-04-01 00Z. The it can be read by the WeatherDataset
class with the parameters subsample_step = 1, control_only = True
. But the hard-coded N_t'=65 is still a problem (it should be 21). So I included a variable n_time_steps_per_file
that I load from a yaml file (see this commit).
In my opinion, the current sample_length
flag (= pred_length + 2
in the WeatherDataset
class) is an independent parameter defining the maximum lead time of the forecast the trained model will do, which can be different to the maximum lead time of the reforecast it uses for training.
I will raise a specific issue about the n_time_steps_per_file
and make a pull request once #31 has gone through, because it can probably be solved with an additional key in the yaml file. Are you OK with that?
I agree with everything you describe Thomas, but given what we are moving towards with the data loading I am not sure if it is necessary to fix these kinds of constants in the current Dataset class. If we look at what @sadamov has started working on in terms of a zarr-loading Dataset (https://github.com/mllam/neural-lam/blob/feature_dataset_yaml/neural_lam/weather_dataset.py) all of this is handled very differently, so the problems you describe should not be present. I'd much rather try to get this zarr-based (re)analysis Dataset class in place and then use this as a basis for an improved (re)forecast Dataset class.
Can we close this now that @sadamov has finished #31? :rocket: Some of the discussion here is relevant to the zarr-based dataset discussions, but maybe we should make a new issue for that?
Sorry, I had forgotten about https://github.com/mllam/neural-lam/issues/24. Maybe the zarr-dataset discussion can continue there?
This issue was closed an superceded by #24 where the work on zarr-archives for boundaries, normalizations statistics and more can continues.
This supersedes #2 and #3.
Motivation
It is currently very hard to work with neural lam on different regions due to everything related to data and the forecast region being specified hard-coded in
constants.py
. It would be much better to specify this in a config file that you can then point to. Yaml seems like a suitable format for this.Proposition
The main training/eval script takes a flag
--spec
that should be given a path to a yaml-file. This yaml file specifies all the data that goes into the model and information about the region the model should be working with.Current options in
constants.py
that relate to what to plot should all be turned into flags.The yaml file should be read in and turned into a single object that contains all useful information and can be passed around in the code (since this is needed almost everywhere). Having this as an object means that it can also compute things not directly in the yaml file, such as units of variables that can be retrieved from loaded xarray data.
Design
Here is an idea for how the yaml file could be laid out with examples:
Start of file:
Some comments to keep track of what this specification is about. We don't enforce any content in there.
Forecast area data configuration
This describes the data configuration for the actual limited area that you are training on. Explicitly, the "inner region", not the "boundary". What is specified is what zarrs to load state, forcing and static grid features from and which variables in these to use for each.
grid
dimension or the name ofx
andy
dimensions. Ifx
andy
then we will flatten ourselves in the data loading code. If onlygrid
is specified, agrid_shape
entry has to be given (see below).Boundary data configuration
The boundary data configuration follows exactly the same structure as the forecast_area, with two differences:
state
entry is allowed, as we do not forecast the boundary nodes atm.Grid shape
If the zarrs already contain flattened grid dimensions we need knowledge of the original 2d spatial shape in order to be able to plot data. For such cases this can be given by an optional
grid_shape
entry:Subset splits
The train/val/test split is defined based on timestamps:
Used by the dataset class to
.sel
the different subsets.Forecast area projection
In order to be able to plot data in the forecasting area we need to know what projection the area is defined in. By plotting in this projection we end up with a flat rectangular area where the data sits. This should be specified as a reference to a cartopy.crs object.
Normalization zarr
We also need information about statistics of variables, boundary and forcing for normalization (mean and std). Additionally we need the inverse variances used in the loss computation. As we compute and save this in a pre-processing script we can enforce a specific format, so lets put all of those also in its own zarr. Then we only need to specify a path here to that zarr to load it from.