Replace constants.py with data + region specification from yaml-file

joeloskarsson commented 1 month ago

This supersedes #2 and #3.

Motivation

It is currently very hard to work with neural lam on different regions due to everything related to data and the forecast region being specified hard-coded in constants.py. It would be much better to specify this in a config file that you can then point to. Yaml seems like a suitable format for this.

Proposition

The main training/eval script takes a flag --spec that should be given a path to a yaml-file. This yaml file specifies all the data that goes into the model and information about the region the model should be working with.

Current options in constants.py that relate to what to plot should all be turned into flags.

The yaml file should be read in and turned into a single object that contains all useful information and can be passed around in the code (since this is needed almost everywhere). Having this as an object means that it can also compute things not directly in the yaml file, such as units of variables that can be retrieved from loaded xarray data.

Design

Here is an idea for how the yaml file could be laid out with examples:

Start of file:

# Data config for MEPS dataset
---

Some comments to keep track of what this specification is about. We don't enforce any content in there.

Forecast area data configuration

This describes the data configuration for the actual limited area that you are training on. Explicitly, the "inner region", not the "boundary". What is specified is what zarrs to load state, forcing and static grid features from and which variables in these to use for each.

forecast_area:
  zarrs: # List of zarrs containing fields related to state
    fields: # Names on this level are arbitrary
      path: data/meps_example/fields.zarr # Path to zarr
      dims: # Name of dimensions in zarr, to be used for indexing
        time: time
        level: level
        grid: grid_node # Either give "grid" (flattened) dimension or "x" and "y"
    forcing: 
      path: data/meps_example/forcing.zarr
      dims:
        time: time
        level: level
        x: x_coord 
        y: y_coord
  state: # Variables forecasted by the model
    surface: # Single-field variables
      -  2t
      - 10u
      - ...
    atmospheric: # Variables with vertical levels
      - z
      - u
      - v
      - ...
    levels: # Levels to use for atmospheric variables 
      - 700
      - 850
      - ...
    units:
      z: m2/s2
      2t: K
      ...
  forcing: # Forcing variables, dynamic inputs to the model
      surface:
        - land_sea_mask # Dynamic in MEPS
        - ...
      atmospheric:
        - ... # Nothing for MEPS, but we allow it
      levels:
        - ...
  static: # Static inputs
      surface:
        - topography 
        - ...
      atmospheric:
        - ... # Nothing for MEPS, but we allow it
      levels:
        - ...
  lat_lon_names: # Name of variables/coordinates in zarrs specifying latitude and longitude of grid cells
    lat: latitude
    lon: longitude

One or more zarrs can be specified. These are all opened as xarray Datasets, coordinates renamed to common names, then joined into one dataset (object in memory).
When specifying the names of dimensions in the zarr, the user needs to either specify the name of a flattened grid dimension or the name of x and y dimensions. If x and y then we will flatten ourselves in the data loading code. If only grid is specified, a grid_shape entry has to be given (see below).
Names of surface and atmospheric fields listed in the different sections should map exactly to variable names in the joined xarray Dataset.
State, forcing and static specifications have exact same structure. Static fields should just lack time-dimension.
Entries not specified is just assumed to not be used, or take default values (variable dimensions). Naturally some fields are required (state variables, at least one zarr).
For units the variable names correspond to both surface and atmospheric variables (we do not allow naming clashes between these). Format for units is up to the user. We could require that all variables need a specified unit, or we accept that some have unspecified units.
We use the latlon information to know where the forecasting area is positioned within the projection. Convert these to coordinates in the projection and take the min/max to get the extent of the forecasting area. These latlon:s are read from the xarray Dataset, from the given variables (can be coordinates, as these are also variables).

Boundary data configuration

The boundary data configuration follows exactly the same structure as the forecast_area, with two differences:

No state entry is allowed, as we do not forecast the boundary nodes atm.
There is a mask entry, specifying what grid cells of the boundary to include. The boundary has its own list of zarrs, to avoid variable name clashes with the forecast area zarrs. Note that we enforce no spatial structure of the boundary w.r.t. forecast area. The boundary nodes can be placed anywhere.

boundary:
  zarrs:
  ...
  mask: boundary_mask # Name of variable containing boolean mask, true for grid nodes to be used in boundary.

Grid shape

If the zarrs already contain flattened grid dimensions we need knowledge of the original 2d spatial shape in order to be able to plot data. For such cases this can be given by an optional grid_shape entry:

grid_shape:
   x: 238
   y: 268

Subset splits

The train/val/test split is defined based on timestamps:

splits:
  train:
    start: 2021-04-01T00
    end: 2022-05-31T23
  val:
    start: 2022-06-01T00
    end: 2022-06-30T23
  test:
    start: 2022-07-01T00
    end: 2023-03-31T23

Used by the dataset class to .sel the different subsets.

Forecast area projection

In order to be able to plot data in the forecasting area we need to know what projection the area is defined in. By plotting in this projection we end up with a flat rectangular area where the data sits. This should be specified as a reference to a cartopy.crs object.

projection: LambertConformal # Name of class in cartopy.crs
kwargs: # Parsed and used directly as kwargs to class above
    central_longitude: 15.0
    central_latitude: 63.3
    standard_parallels:
       - 63.3
       - 63.3

Normalization zarr

We also need information about statistics of variables, boundary and forcing for normalization (mean and std). Additionally we need the inverse variances used in the loss computation. As we compute and save this in a pre-processing script we can enforce a specific format, so lets put all of those also in its own zarr. Then we only need to specify a path here to that zarr to load it from.

normalization_zarr: data/meps_example/norm.zarr

sadamov commented 1 month ago

Suggestion how the projection and regional extent can be specified:

projections:

  lambert_conformal_conic:
    proj_class: LambertConformal
    proj_params:
      central_longitude: 15.0
      central_latitude: 63.3
      standard_parallels: [63.3, 63.3]

  mercator:
    proj_class: Mercator
    proj_params:
      central_longitude: 0.0
      min_latitude: -80.0
      max_latitude: 84.0

  stereographic:
    proj_class: Stereographic
    proj_params:
      central_longitude: 0.0
      central_latitude: 90.0
      true_scale_latitude: 60.0

  rotated_pole:
    proj_class: RotatedPole
    proj_params:
      pole_longitude: 10.0
      pole_latitude: -43.0

  robinson:
    proj_class: Robinson
    proj_params:
      central_longitude: 0.0

  plate_carree:
    proj_class: PlateCarree
    proj_params:
      central_longitude: 0.0

limits:
  x_limits: [-6.82, 4.8]
  y_limits: [-4.42, 3.36]

ThomasRieutord commented 1 month ago

Hi all, thanks for adding me. Indeed, there is something that I didn't find in this PR (as it got big and scattered in many issues it may be there but I missed it) and that was causing me some trouble: the nature of the data (reforecast vs reanalysis) and the variables that go with it (maximum lead time -forecast only-, time step, number of time steps in a single file...).

Specifically, the number of time step in a file is currently hard-coded (N_t' = 65 in the WeatherDataset class) and I had to change it to read my data. I can give you more details on how I worked around it but a new issue would probably be more appropriate.

What do you think?

sadamov commented 1 month ago

Hi @ThomasRieutord, that is a good point. We did not have reforecast support on the roadmap, but I agree that it should be. Yes, there are quite a few Issues and PRs open right now. I have a suggestion: Since the MEPS data is actually a reforecast, you could implement your solution for handling such reforecast data in https://github.com/mllam/neural-lam/pull/31? We are getting rid of everything hardcoded, so I think our goals are quite aligned :) What do you think?

PS: Do you also require the sample_length flag for model training based on reforecasts. I feel like this flag makes the model much harder to understand and would vote to remove that option. Thoughts?

joeloskarsson commented 1 month ago

I agree that having a Dataset class also for training on (re-)forecasts would be nice. But at the same time I do not mind changing everything to assuming (re-)analysis now first. Then we can add back such handling of forecast data again later.

The current handling of sample_length is indeed convoluted and make things hard to understand. This can likely be cleaned up by a nicer implementation. There still needs to be 1) a way to set pred_length, and extract a sub-forecast to train on as a sample and 2) a check that you are not trying to unroll the model for more steps than there are time steps in the forecast.

TomasLandelius commented 1 month ago

Regarding the boundary data configuration. How do we specify the grid for the host model if it is on a global grid? Most ERA5 versions are on a regular lat-lon grid (although native grid is reduced Gaussian) white the default grid for the (A)IFS HRES (probable choice in an operational setting) I guess is a (reduced) Gaussian grid (https://confluence.ecmwf.int/display/UDOC/Gaussian+grids).

Start by only supporting (assuming) regular lat-lon? However this causes the grid node distribution to be quite uneven towards the poles (probably quite noticeable for the MEPS domain). This will then have an effect on the g2m where some nodes will have contributions from many more grid points than others? @joeloskarsson maybe already solved this for the global model?

TomasLandelius commented 1 month ago

Does splitting the "state" into surface, atmospheric and levels add value (e.g. what if all variables are not at all levels)? What about just having a single list with state variables defined by names from ECMWF or CF name tables: https://codes.ecmwf.int/grib/param-db/ https://cfconventions.org/Data/cf-standard-names/current/build/cf-standard-name-table.html

ThomasRieutord commented 1 month ago

Hi @sadamov and @joeloskarsson, I think #31 is in good shape and answering a slightly different problem, so I would rather have it done and remove the hard-coded 65 time steps in another issue.

Just to let you know how I did so far. On the one hand, the MEPS example is an ensemble forecast archive (so format = reforecast). For example the file nwp_2022040100_mbr000 contains 65-h hourly forecast from member 0 starting from analysis at 2022-04-01 00Z. On the other hand, MERA is a deterministic reanalysis with 3h between each analysis. To stick as much as possible to the provided MEPS example, the file nwp_2022040100_mbr000 contains 21 3-hourly analysis starting at 2022-04-01 00Z. The it can be read by the WeatherDataset class with the parameters subsample_step = 1, control_only = True. But the hard-coded N_t'=65 is still a problem (it should be 21). So I included a variable n_time_steps_per_file that I load from a yaml file (see this commit).

In my opinion, the current sample_length flag (= pred_length + 2 in the WeatherDataset class) is an independent parameter defining the maximum lead time of the forecast the trained model will do, which can be different to the maximum lead time of the reforecast it uses for training.

I will raise a specific issue about the n_time_steps_per_file and make a pull request once #31 has gone through, because it can probably be solved with an additional key in the yaml file. Are you OK with that?

joeloskarsson commented 1 month ago

I agree with everything you describe Thomas, but given what we are moving towards with the data loading I am not sure if it is necessary to fix these kinds of constants in the current Dataset class. If we look at what @sadamov has started working on in terms of a zarr-loading Dataset (https://github.com/mllam/neural-lam/blob/feature_dataset_yaml/neural_lam/weather_dataset.py) all of this is handled very differently, so the problems you describe should not be present. I'd much rather try to get this zarr-based (re)analysis Dataset class in place and then use this as a basis for an improved (re)forecast Dataset class.

leifdenby commented 1 month ago

Can we close this now that @sadamov has finished #31? :rocket: Some of the discussion here is relevant to the zarr-based dataset discussions, but maybe we should make a new issue for that?

leifdenby commented 1 month ago

Sorry, I had forgotten about https://github.com/mllam/neural-lam/issues/24. Maybe the zarr-dataset discussion can continue there?

sadamov commented 1 month ago

This issue was closed an superceded by #24 where the work on zarr-archives for boundaries, normalizations statistics and more can continues.

mllam / neural-lam