Open sadamov opened 1 month ago
@leifdenby As we discussed in the meeting: the boundary mask certainly has high importance as it is not a simple static feature but potentially used in loss-calculation and graph-design. As you can see from the plots, I only got so far as to create a very basic working example for MEPS+ERA5 as a basis for discussion and future PRs. It would be great to get your boundary masking expertise here :handshake:
For many static (climatological) surface parameters I guess ECOCLIMAP-II would work: https://opensource.umr-cnrm.fr/projects/ecoclimap/wiki (1 km resolution) When it comes to more dynamical things like sea and lake ice I guess one would get that from satellite based products. Leaving this for later now.
I have something to do
create_boundary_mask(lonlat_state, lonlat_boundary, crs_state, boundary_min_distance, boundary_max_distance, overlap=False)
producing a result like this for the MEPS domain with ERA5 (latlon projection) on the boundary.
Regarding TOA irradiance. What about something like this:
def toa_solirr(lat, lon, day, hr_utc):
dec = -23.45 * np.cos(np.pi/180*360/365*(day-81)) * np.pi/180
hr_lst = hr_utc + lon/15
hr_angle = 15 * (hr_lst - 12)
return np.sin(lat*np.pi/180) * np.sin(dec) + np.cos(lat*np.pi/180) * np.cos(dec) * np.cos(hr_angle*np.pi/180)
Is that also too slow?
%timeit toa_solirr(lat_era, lon_era, 0, 0) 76.7 ms ± 314 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
I have something to do
create_boundary_mask(lonlat_state, lonlat_boundary, crs_state, boundary_min_distance, boundary_max_distance, overlap=False)
producing a result like this for the MEPS domain with ERA5 (latlon projection) on the boundary.
This looks great I think! How general is this (e.g. how hard to use it for another domain, another projection)? Can you share the full code for this? What is the unit for the distances now? meters?
Regarding TOA irradiance. What about something like this:
def toa_solirr(lat, lon, day, hr_utc): dec = -23.45 * np.cos(np.pi/180*360/365*(day-81)) * np.pi/180 hr_lst = hr_utc + lon/15 hr_angle = 15 * (hr_lst - 12) return np.sin(lat*np.pi/180) * np.sin(dec) + np.cos(lat*np.pi/180) * np.cos(dec) * np.cos(hr_angle*np.pi/180)
Is that also too slow?
%timeit toa_solirr(lat_era, lon_era, 0, 0) 76.7 ms ± 314 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Since it is entirely parallelizable that should be quite fast. I tested a parallel version with broadcasting:
import numpy as np
import xarray as xr
import time
zarr_path = "data/era5_latlon/ll.zarr"
ll_xds = xr.open_zarr(zarr_path)
lon_1d = ll_xds.longitude.to_numpy() # (num_lon,)
lat_1d = ll_xds.latitude.to_numpy() # (num_lat,)
# dim order is (lat, lon, day, hr)
def toa_solirr(lat, lon, day, hr_utc):
lat = lat[:, np.newaxis, np.newaxis, np.newaxis]
lon = lon[np.newaxis, :, np.newaxis, np.newaxis]
day = day[np.newaxis, np.newaxis, :, np.newaxis]
hr_utc = hr_utc[np.newaxis, np.newaxis, np.newaxis, :]
dec = -23.45 * np.cos(np.pi / 180 * 360 / 365 * (day - 81)) * np.pi / 180
hr_lst = hr_utc + lon / 15
hr_angle = 15 * (hr_lst - 12)
return np.sin(lat * np.pi / 180) * np.sin(dec) + np.cos(
lat * np.pi / 180
) * np.cos(dec) * np.cos(hr_angle * np.pi / 180)
day = np.arange(7)
hour = np.arange(24)
start_time = time.time()
toa = toa_solirr(lat_1d, lon_1d, day, hour)
end_time = time.time()
print(f"Took: {end_time-start_time} s")
(quick test code, but does the trick)
That runs in around 1 second for a week, so around a minute per year (on my laptop, latlons from ERA5 0.25 deg.), which seems acceptable.
I have not looked into the math of this, but what is the difference between this and the GraphCast TOA implementation (https://github.com/google-deepmind/graphcast/blob/main/graphcast/solar_radiation.py)? A benefit of that one is that it uses Xarray. I quickly run out of memory doing this in numpy.
I don't think it's a big problem if this takes a bit of time, since you should not have to re-compute it very often. Another thought (given how paralellizable this is), is to try to do it on GPU? Could almost just replace the np
in Tomas example with torch
. But then there will be a lot of data to shuffle back to the CPU and save to disk.
The difference to GC implementation is that they focus on having something very similar to what goes into ERA5 while mine opts for as simple as possible. Maybe too simple? It does for example not care about the Sun-Earth distance that causes ca 7 % variation during the year. Still, I think other things contribute more to the error than imperfections in this forcing. However, I've not checked the output from my version with more precise ones - should be done.
Here is an update with a correction to the declination. It now outputs irradiance [Wm**-2] so that it can be checked more easily. Also input is given by DataArray coordinates.
def toa_solirr(dt, lon, lat):
# same dim order as in xda
lon = lon.values[np.newaxis, :, np.newaxis]
lat = lat.values[np.newaxis, np.newaxis, :]
day = dt.dayofyear.values[:, np.newaxis, np.newaxis]
hr_utc = dt.hour.values[:, np.newaxis, np.newaxis]
dec = np.pi/180 * 23.45 * np.sin(2*np.pi * (284+day) / 365)
hr_lst = hr_utc + lon / 15
hr_angle = 15 * (hr_lst - 12)
cos_sza = np.sin(lat * np.pi / 180) * np.sin(dec) + np.cos(
lat * np.pi / 180
) * np.cos(dec) * np.cos(hr_angle * np.pi / 180)
return np.fmax(0, 1366 * cos_sza)
toa = toa_solirr(xda['time'].dt, xda['longitude'], xda['latitude'])
I tried to get it to work as a _ufunc like this but to no avail. Maybe someone familiar with Dask can make it work?
toa = xr.apply_ufunc(
toa_solirr,
xda['time'].dt,
xda['longitude'],
xda['latitude'],
input_core_dims=[['time'], ['latitude'], ['longitude']],
output_core_dims=[[]],
dask='parallelized'
)
The datetime forcings are now implemented in #54 without the need for a seconds_in_year
constant since that information is now directly retrieved from the xarray Dataset.
Summary
For international collaboration a selection of state variables, as well as static and forcing features were selected (see here for reference: Sheet). While the variables are provided by the user in a zarr-archive (usually NWP-model output), the additional forcing and static features are often not available. I suggest to provide the user with the option to generate these fields using a new script create_forcings.py. The script works for state and boundary domains alike, simple visualizations were provided to make this issue easier to understand. Please feel free to improve on any part of the script and help writing PRs :rocket:
New fields
Questions / Issues:
Function Details and Rationale
calculate_datetime_forcing(ds, args)
This function calculates the datetime forcing for the neural LAM model by generating sinusoidal representations of the hour of the day and the time of the year. These features help the model understand diurnal and seasonal variations. Previously these were part of the PyTorch Dataset, provided in
__get_item__
; it certainly makes more sense to calculate these in advance, should be more efficient and allows for cleaner Dataset code.generate_land_sea_mask(xy_state, tempdir, projection, high_res_factor=10)
This function generates a land-sea mask for the neural LAM model. The shapefile for the land-sea mask comes from natural-earth at 50m resolution. Percentage of gridcell covered by land is calculated by rasterization of high-res shapefile. This generates static fields, for some regions this mask will vary in time as parts of the sea freeze.![image](https://github.com/mllam/neural-lam/assets/45732287/6ac99fe7-6312-4825-9ba1-f073e246242f)
create_boundary_mask(lonlat_state, lonlat_boundary, boundary_thickness, overlap=False)
This function creates a boundary mask for the neural LAM model. The boundary mask defines the edges of the model's domain and helps in handling boundary conditions, which are essential for limited area modeling. It allows for overlap (Flatbread) or non-overlapping boundaries (Donut). The borders of the interior state domain are identified using
anemoi-dataset
functionalitites for lat-lon matching of boundary and state. The thickness of the boundary is applied using binary_dilution, which is a terrible approach -> should be calculated in meters in projection of state.generate_toa_radiation_forcing(ds, lonlat)
This function pre-computes all static features related to the grid nodes for top-of-atmosphere (TOA) radiation. This functionality is provided by the graphcast repository. It is however quite slow for large domains and again depends on lat lon. Maybe there are better options.
Usage
The example script is based on MEPS (since we all have that data) and ERA5, directly from the google weatherbench cloud storage. To run it, store the script on the first level of the NeuralLAM repo, with
meps_example
in the data-folder: