Creation of Static and Forcing Fields

sadamov commented 1 month ago

Summary

For international collaboration a selection of state variables, as well as static and forcing features were selected (see here for reference: Sheet). While the variables are provided by the user in a zarr-archive (usually NWP-model output), the additional forcing and static features are often not available. I suggest to provide the user with the option to generate these fields using a new script create_forcings.py. The script works for state and boundary domains alike, simple visualizations were provided to make this issue easier to understand. Please feel free to improve on any part of the script and help writing PRs :rocket:

New fields

datetime forcings,
land-sea masks,
boundary masks,
top-of-atmosphere (TOA) radiation forcings.

Questions / Issues:

Does such a script belong in the neuralLAM repo or in a pre-processing step?
Should it be one or multiple scripts?
I have a script version for COSMO and MEPS data - should become domain agnostic
See below for questions specific to each field
Currently the fields are stored in one .zarr archive. It probably makes sense to specify the zarr archive where each field should append to.

Function Details and Rationale

`calculate_datetime_forcing(ds, args)`

This function calculates the datetime forcing for the neural LAM model by generating sinusoidal representations of the hour of the day and the time of the year. These features help the model understand diurnal and seasonal variations. Previously these were part of the PyTorch Dataset, provided in __get_item__; it certainly makes more sense to calculate these in advance, should be more efficient and allows for cleaner Dataset code.

`generate_land_sea_mask(xy_state, tempdir, projection, high_res_factor=10)`

This function generates a land-sea mask for the neural LAM model. The shapefile for the land-sea mask comes from natural-earth at 50m resolution. Percentage of gridcell covered by land is calculated by rasterization of high-res shapefile. This generates static fields, for some regions this mask will vary in time as parts of the sea freeze.

`create_boundary_mask(lonlat_state, lonlat_boundary, boundary_thickness, overlap=False)`

This function creates a boundary mask for the neural LAM model. The boundary mask defines the edges of the model's domain and helps in handling boundary conditions, which are essential for limited area modeling. It allows for overlap (Flatbread) or non-overlapping boundaries (Donut). The borders of the interior state domain are identified using anemoi-dataset functionalitites for lat-lon matching of boundary and state. The thickness of the boundary is applied using binary_dilution, which is a terrible approach -> should be calculated in meters in projection of state.

`generate_toa_radiation_forcing(ds, lonlat)`

This function pre-computes all static features related to the grid nodes for top-of-atmosphere (TOA) radiation. This functionality is provided by the graphcast repository. It is however quite slow for large domains and again depends on lat lon. Maybe there are better options.

Usage

The example script is based on MEPS (since we all have that data) and ERA5, directly from the google weatherbench cloud storage. To run it, store the script on the first level of the NeuralLAM repo, with meps_example in the data-folder:

pip install gcsfs 
pip install rasterio
pip install git+https://github.com/google-deepmind/graphcast.git
python create_forcings.py

sadamov commented 1 month ago

@leifdenby As we discussed in the meeting: the boundary mask certainly has high importance as it is not a simple static feature but potentially used in loss-calculation and graph-design. As you can see from the plots, I only got so far as to create a very basic working example for MEPS+ERA5 as a basis for discussion and future PRs. It would be great to get your boundary masking expertise here :handshake:

joeloskarsson commented 1 month ago

Is a land-sea mask or a land-water mask (including lakes) most relevant? I have seen that for the MEPS area the large Swedish lakes impact some surface variables, so it would perhaps be nice to be able to inform the model of lakes also. Although, we could definitely provide a utility for land-sea mask and then if one wants something more specific (like way I describe) you use have to get that from somewhere else and put it into your zarr separately.
I am generally for including these types of utility scripts in neural-lam, as they are highly useful (even crucial) for working with the codebase. I think it would be good to group all such scripts in some way though, perhaps as a utility sub-module or similar.

TomasLandelius commented 1 month ago

For many static (climatological) surface parameters I guess ECOCLIMAP-II would work: https://opensource.umr-cnrm.fr/projects/ecoclimap/wiki (1 km resolution) When it comes to more dynamical things like sea and lake ice I guess one would get that from satellite based products. Leaving this for later now.

TomasLandelius commented 1 month ago

I have something to do

create_boundary_mask(lonlat_state, lonlat_boundary, crs_state, boundary_min_distance, boundary_max_distance, overlap=False)

producing a result like this for the MEPS domain with ERA5 (latlon projection) on the boundary. Figure_1

TomasLandelius commented 1 month ago

Regarding TOA irradiance. What about something like this:

def toa_solirr(lat, lon, day, hr_utc):
    dec = -23.45 * np.cos(np.pi/180*360/365*(day-81)) * np.pi/180
    hr_lst = hr_utc + lon/15
    hr_angle = 15 * (hr_lst - 12)
    return np.sin(lat*np.pi/180) * np.sin(dec) + np.cos(lat*np.pi/180) * np.cos(dec) * np.cos(hr_angle*np.pi/180)

Is that also too slow?

%timeit toa_solirr(lat_era, lon_era, 0, 0) 76.7 ms ± 314 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

joeloskarsson commented 4 weeks ago

I have something to do

create_boundary_mask(lonlat_state, lonlat_boundary, crs_state, boundary_min_distance, boundary_max_distance, overlap=False)

producing a result like this for the MEPS domain with ERA5 (latlon projection) on the boundary.

This looks great I think! How general is this (e.g. how hard to use it for another domain, another projection)? Can you share the full code for this? What is the unit for the distances now? meters?

Regarding TOA irradiance. What about something like this:
def toa_solirr(lat, lon, day, hr_utc):
    dec = -23.45 * np.cos(np.pi/180*360/365*(day-81)) * np.pi/180
    hr_lst = hr_utc + lon/15
    hr_angle = 15 * (hr_lst - 12)
    return np.sin(lat*np.pi/180) * np.sin(dec) + np.cos(lat*np.pi/180) * np.cos(dec) * np.cos(hr_angle*np.pi/180)
Is that also too slow?

%timeit toa_solirr(lat_era, lon_era, 0, 0) 76.7 ms ± 314 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Since it is entirely parallelizable that should be quite fast. I tested a parallel version with broadcasting:

import numpy as np
import xarray as xr
import time

zarr_path = "data/era5_latlon/ll.zarr"

ll_xds = xr.open_zarr(zarr_path)

lon_1d = ll_xds.longitude.to_numpy() # (num_lon,)
lat_1d = ll_xds.latitude.to_numpy() # (num_lat,)

# dim order is (lat, lon, day, hr)
def toa_solirr(lat, lon, day, hr_utc):
    lat = lat[:, np.newaxis, np.newaxis, np.newaxis]
    lon = lon[np.newaxis, :, np.newaxis, np.newaxis]
    day = day[np.newaxis, np.newaxis, :, np.newaxis]
    hr_utc = hr_utc[np.newaxis, np.newaxis, np.newaxis, :]

    dec = -23.45 * np.cos(np.pi / 180 * 360 / 365 * (day - 81)) * np.pi / 180
    hr_lst = hr_utc + lon / 15
    hr_angle = 15 * (hr_lst - 12)
    return np.sin(lat * np.pi / 180) * np.sin(dec) + np.cos(
        lat * np.pi / 180
    ) * np.cos(dec) * np.cos(hr_angle * np.pi / 180)

day = np.arange(7)
hour = np.arange(24)

start_time = time.time()
toa = toa_solirr(lat_1d, lon_1d, day, hour)
end_time = time.time()
print(f"Took: {end_time-start_time} s")

(quick test code, but does the trick) That runs in around 1 second for a week, so around a minute per year (on my laptop, latlons from ERA5 0.25 deg.), which seems acceptable. I have not looked into the math of this, but what is the difference between this and the GraphCast TOA implementation (https://github.com/google-deepmind/graphcast/blob/main/graphcast/solar_radiation.py)? A benefit of that one is that it uses Xarray. I quickly run out of memory doing this in numpy. I don't think it's a big problem if this takes a bit of time, since you should not have to re-compute it very often. Another thought (given how paralellizable this is), is to try to do it on GPU? Could almost just replace the np in Tomas example with torch. But then there will be a lot of data to shuffle back to the CPU and save to disk.

TomasLandelius commented 4 weeks ago

The difference to GC implementation is that they focus on having something very similar to what goes into ERA5 while mine opts for as simple as possible. Maybe too simple? It does for example not care about the Sun-Earth distance that causes ca 7 % variation during the year. Still, I think other things contribute more to the error than imperfections in this forcing. However, I've not checked the output from my version with more precise ones - should be done.

TomasLandelius commented 3 weeks ago

Here is an update with a correction to the declination. It now outputs irradiance [Wm**-2] so that it can be checked more easily. Also input is given by DataArray coordinates.

def toa_solirr(dt, lon, lat):
    # same dim order as in xda
    lon = lon.values[np.newaxis, :, np.newaxis]
    lat = lat.values[np.newaxis, np.newaxis, :]
    day = dt.dayofyear.values[:, np.newaxis, np.newaxis]
    hr_utc = dt.hour.values[:, np.newaxis, np.newaxis]
    dec = np.pi/180 * 23.45 * np.sin(2*np.pi * (284+day) / 365) 
    hr_lst = hr_utc + lon / 15
    hr_angle = 15 * (hr_lst - 12)
    cos_sza = np.sin(lat * np.pi / 180) * np.sin(dec) + np.cos(
        lat * np.pi / 180
    ) * np.cos(dec) * np.cos(hr_angle * np.pi / 180)
    return np.fmax(0, 1366 * cos_sza) 

toa = toa_solirr(xda['time'].dt, xda['longitude'], xda['latitude'])

I tried to get it to work as a _ufunc like this but to no avail. Maybe someone familiar with Dask can make it work?

toa = xr.apply_ufunc(
    toa_solirr,
    xda['time'].dt, 
    xda['longitude'],
    xda['latitude'],
    input_core_dims=[['time'], ['latitude'], ['longitude']],
    output_core_dims=[[]],
    dask='parallelized'
)

sadamov commented 3 weeks ago

The datetime forcings are now implemented in #54 without the need for a seconds_in_year constant since that information is now directly retrieved from the xarray Dataset.

mllam / neural-lam