Move OSGB transformation hack out of nwp-consumer and into ocf-datapipes

devsjc commented 1 year ago

In order to quickly deploy the consumer and have it work with the soon to be retired PVNet1, which expects the "x" and "y" coordinates to be (poorly labelled) OSGB coordinates, a "quick fix" was put in place in the consumer to approximate those values. (See inputs/metoffice/client.py) This is not an optimal end position however, as OCF are moving away from OSGB and towards lat/long anyway, not to mention the ambiguity that a bare "x" and "y" coordinate add to the dataset.

Instead, this logic to convert from lat long to OSGB should be implemented in ocf-datapipes, then the version of that library being used in PVNet1/2 can be bumped accordingly.

As such, this issue mostly requires code changes in ocf-datapipes and the PVNet1/2 repos. The net change in this repo should be exactly the inverse of that which was added with the implementation of the "quick fix" in https://github.com/openclimatefix/nwp-consumer/pull/27, i.e. removing all that was added there and nothing else.

peterdudfield commented 1 year ago

@dfulu is working on this I think in datapipes

peterdudfield commented 1 year ago

@devsjc you might find @dfulu has now sorted this, but not totlly sure

dfulu commented 1 year ago

It may already be fixed without me intending to do so. But what would the coordinates look like without #27?

peterdudfield commented 1 year ago

Does PVnet version right now uses lat/lon of osgb?

dfulu commented 1 year ago

The PVNet modeld oesn't use either. It just uses the NWP image. But the pvnet datapipe inside ocf_datapipes is currently using OSGB.

I think there are some issues to sort out before we could drop the OSGB coordinates. At least we need to change the data loading function: https://github.com/openclimatefix/ocf_datapipes/blob/main/ocf_datapipes/load/nwp/providers/ukv.py

But also I was looking at the current production NWP data.

<xarray.Dataset>
Dimensions:    (variable: 5, init_time: 1, step: 43, y: 639, x: 455)
Coordinates:
  * init_time  (init_time) datetime64[ns] 2023-10-26T09:00:00
    latitude   (y, x) float64 ...
    longitude  (y, x) float64 ...
  * step       (step) timedelta64[ns] 00:00:00 01:00:00 ... 1 days 18:00:00
  * variable   (variable) object 'dswrf' 'lcc' 'sde' 't' 'wdir10'
  * x          (x) int64 -202708 -200701 -198695 ... 700567 702566 704564
  * y          (y) int64 1262937 1260928 1258919 ... -13442 -15450 -17458
Data variables:
    UKV        (variable, init_time, step, y, x) float32 ...

So the latitude and longitude are each 2-dimensional?

When I select ds_nwp.latitude.isel(y=0) I get

<xarray.DataArray 'latitude' (x: 455)>
array([60.788934, 60.791687, 60.79443 , ..., 61.058338, 61.05675 , 61.055152])
Coordinates:
    latitude   (x) float64 ...
    longitude  (x) float64 ...
  * x          (x) int64 -202708 -200701 -198695 ... 700567 702566 704564
    y          int64 1262937
Attributes:
    long_name:      latitude
    standard_name:  latitude
    units:          degrees_north

Which means the latitude varies a little bit depending on x-OSGB.

We currently don't have support in ocf_datapipes for two-dimensional coordinates. We always assume the coordinates are independent and form a regular grid. So it would actually be a decent chunk of work to switch to using lat-lons.

dfulu commented 1 year ago

Although it looks like the OSGB coordinates which we calculate might also each be two-dimensional, but we just ignore that?

https://github.com/openclimatefix/nwp-consumer/blob/main/src/nwp_consumer/internal/inputs/metoffice/client.py#L281-L290

peterdudfield commented 1 year ago

Yea the lat lons are a 2-dimension grid. So the rough order of work would be

Upgrade ocf_datapipes so it could take 2-d lat/lons
change PVnet to use new ocf_datapipes and use lat/lon
Check national_xg doesnt use osgb
remove osgb from nwp-consumer?

peterdudfield commented 1 year ago

Or perhaps just 1. Move the osgb calculation into ocf_dataipes

dfulu commented 1 year ago

I made a plot of the OSGB coordinates we calculate in client.py#L281-L290 vs what they would be used 2D OSGB.

In the figure below, the red points are our current grid points, and the blue points are the points using 2D OSGB. There are some lines showing the offset between the two.

This means that in production we are accidentally offsetting our windows by 3-4 pixels. This is likely something we want to fix

NB: this figure needs a download and zoom in to see the gridpoints and offsets properly

prod_current_vs_actual (2)

peterdudfield commented 1 year ago

Nice, thanks for doing this, how do we go about fixing it?

dfulu commented 1 year ago

@jacobbieker had suggested we might regrid to the same grid used in our training data. That might be the easiest. It also should work if we can get datapipes working with 2D spatial grids and select directly from lat-lon - but our input images might be slightly distorted compared to the training data. So I don't know which would be best

dfulu commented 1 year ago

Here's a closer up view of the offset

peterdudfield commented 1 year ago

Can you do this re-gridding in ocf_datapipes?

dfulu commented 1 year ago

We'd have to write it in, won't have any regridding functionality yet. However, could it not make sense to do it within the consumer? It's likely that every model that uses the UKV data in production will have been trained on the UKV CEDA archive data, so trained using that grid. So each model would need to perform this regridding independently. I can't really think of any uses for the UKV data on the grid it comes in on

peterdudfield commented 1 year ago

Sorry maybe im confused, what the regridding from and to?

dfulu commented 1 year ago

Ahh right, so @jacobbieker was suggesting, and I think I would agree, that we should regrid the data to the same grid as the training data. Currently, the production data is on a different grid than the training data. It's not just how we access the coordinates of the data, but the data is centred on different spatial points. This also means that we have a slight spatial distortion between the training and production data

In the fig below, the blue points are the actual grid (2D) of the production data. In black is the grid of the training data. I'm not sure how much the distortion will affect our results now (probably quite little) but if we moved to bigger image patch sizes it could

prod_actual_vs_train (1)

peterdudfield commented 1 year ago

And what grid is the training data on? What grid is the live data on? Are they both lat lons? Perhaps the grid doesnt have a name its just what the data comes in.

I like the idea that the nwp-consumer is pretty clean and doesnt do that kinda of stuff, or perhaps there's a option to turn this regridded on in the nwp-consumer.

Would we have to save the grid of the training data, in a file. Load this in the nwp-consumer and then re-grid?

devsjc commented 1 year ago

So far as I'm aware, the live (metoffice) data arrives in lat/lon, and it is regridded to osgb x/y as the training (ceda) data is in OSGB.

dfulu commented 1 year ago

The training is on a regularly-spaced OSGB coordinate grid, whereas the production data isn't on a regularly spaced OSGB of lat-lon grid. I suspect the production data is actually on some grid of its own

I like the idea that the nwp-consumer is pretty clean and doesnt do that kinda of stuff, or perhaps there's a option to turn this regridded on in the nwp-consumer.

Yeh that's fair enough. I guess another option is just to have some utility function which could be used in any current and future model apps.

I'm not sure it makes sense to do the regridding in datapipes. For regridding, we need to actually load the underlying data, not just the coordinates, in order to do the interpolation to the new grid. To find a value on the new grid via interpolation, other surrounding spatial points must be loaded. I think this means that if we select a NWP window in datapipes, we will need to load the chunk containing the window of interest as well as all surrounding chunks. So if we try to do the regridding lazily, its going be a lot of repeated computation to do the regridding 317 times (once for each GSP) - it would also add to the RAM footprint of each worker. And if we did it eagerly, regridding the whole dataset in RAM before slicing windows it means each worker will also have a copy of the NWP in RAM. One copy of the dataset (data-national) is currently 240MB in RAM. So I think it makes more sense to do the regridding once and save it to disk, then pull from that. We probably could put that regridding, disk cacheing, and reloading into datapipes, but maybe that's messy?

Would we have to save the grid of the training data, in a file. Load this in the nwp-consumer and then re-grid?

Yeh I think so

dfulu commented 1 year ago

So far as I'm aware, the live (metoffice) data arrives in lat/lon, and it is regridded to osgb x/y as the training (ceda) data is in OSGB.

I think we're talking about different things when we say "regridding". I might be wrong, but I thought what we do currently is a translation of the coordinate from lat-lon to OSGB - so the underlying data values stay the same. "Regridding" the way I'm talking about it is to do an interpolation to different spatial points. Like in the figure above we'd interpolate all the blue datapoints (the production grid) to the location of the black ones (the training grid).

jacobbieker commented 1 year ago

Yeah, I would agree with @dfulu on regridding the data once and resaving it to disk, or having it in the consumer, since one of the ideas of the consumer is that it keeps the prod and training data the same. According to this, the operational UKV is not on a regular grid at all, there is an inner 1.5km grid that then slowly becomes less dense until it is a 4km grid on the edges, so that might be the cause of the mismatch.

peterdudfield commented 1 year ago

Ah ok, think I agree, so perhaps the task is for the nwp-consumer to regrid the data do a regular osgb grid, then itll be the same as the training data

devsjc commented 1 year ago

one of the ideas of the consumer is that it keeps the prod and training data the same

Indeed, and with this in mind, perhaps it makes sense to interpolate in the consumer. As long as we're happy with the zarrs it produces not being the data as was from that source! However, if we are interpolating, is OSGB the best thing to choose? What with us now consuming NWP data from India, Malta etc, perhaps we should set the standard now to lat long for forward compatibility, and do a batch regrid of the training data?

peterdudfield commented 1 year ago

I think in general we should try to use lat/lon grids, as this is what other people do.

The Metoffice is probably a special use case, where the training data is in OSGB grid, and therefore I think the nwp-consumer should also make it like this. Perhaps we can have an option to turn this on/off for the Metoffice data. This means we could leave it in lat/lon grid if we decided to use that in the future

dfulu commented 1 year ago

As long as we're happy with the zarrs it produces not being the data as was from that source!

I think that the training data must have been regridded at some point, so I think thats fine. Also like Jacob said, for model performance the biggest concern should be that the training and production data are the same.

So I'd go towards making our production data look as close as possible to the training data rather than a close as possible to the source data

devsjc commented 1 year ago

Do we save the MetOffice data as we go into a rolling archive? If we're downloading it anyway for the production systems we should push it to leonardo too, that way going forward we'd have an archive of data exactly as it was in live, also in lat long. Then we could do a batch interpolate of the old ceda training data to match, and then we'd be consistent going forward. I appreciate this would maybe mean retraining (although in live the data is lat long anyway, so the model must be able to cope).

Seems to me that hinging our systems around legacy training data that is incompatible with both the norm as per Peter's comment and our ability to more easily expand into other parts of the globe would artificially hold us back.

dfulu commented 1 year ago

Having a rolling archive would be great from tracking down any blips in the forecast!

peterdudfield commented 1 year ago

Yea, that could be a good task for Dagster to go and do,

Trying getting raw data from s3, perhaps once a day
if its not there, then get it from CEDA

peterdudfield commented 1 year ago

Could we (just discussing this with @devsjc )

Get live ocf_datapipes to load NWP using lat long 2-D grid (currently using calulcated osgb from nwp-consumer)
Training ocf_dataipes load using osgb
add lat, lon 2-D grid to data from nwp-consumer, to make live and ceda more similar. If the data source has lat/lon already, then we dont need to do this e.g ECMWF

When I say lat,lon 2-D grid i mean a lat lon per point Screenshot 2023-10-27 at 12 31 07

dfulu commented 1 year ago

I think we need to do the regridding step somewhere, not just transform the coords. Here's how we might do the regridding properly:

import numpy as np
import xarray as xr
import ocf_blosc2
import pyproj

 # This is a commonly used regridding package for weather/climate data in xarray
import xesmf as xe

# 1. Load data
ds_train = xr.open_zarr("eg. UKV_intermediate_version_7.zarr")
ds_prod = xr.open_zarr("eg. latest.zarr")

# 2. Need to find the lat-lons of the training data for the regrid

OSGB2latlonTransformer = pyproj.Transformer.from_crs(
    crs_from=27700,
    crs_to=4326,
    always_xy=True,
)

x_osgb_2d_train, y_osgb_2d_train = np.meshgrid(ds_train.x.values, ds_train.y.values)
lons_2d_train, lats_2d_train = OSGB2latlonTransformer.transform(
    x_osgb_2d_train,
    y_osgb_2d_train,
)

# 3. These are the coords we are aiming for - we could save this out so we don't need to compute it every time
ds_target_coords = xr.Dataset(
    data_vars=dict(
        longitude=(["y", "x"], lons_2d_train),
        latitude=(["y", "x"], lats_2d_train),
    ),
    coords=dict(
        x=ds_train.x.values,
        y=ds_train.y.values,
    )
)

# 4. Eager loading probably better for this
ds_prod = ds_prod.compute()

# 5. Construct the regridding object - I think even this can be saved out
regridder = xe.Regridder(ds_prod, ds_target_coords, method="bilinear")

# 6. Run regridding
ds_prod_regridded = regridder(ds_prod).compute()

It takes about 10s on my GCP instance, but its got a lot of cores.

dfulu commented 1 year ago

After regridding, the priductuon NWP data (specifically temp) looks like this: download (16)

dfulu commented 1 year ago

In the image above the white boarder is missing data - i.e. in production we have a smaller spatial area than we do in training. I didn't know this was the case actually, and it is something we'll need to think about if we want to use bigger NWP window sizes - especially considering around the Cornwall and east Anglia GSPs

peterdudfield commented 1 year ago

In the image above the white board is missing data - i.e. in production we have a smaller spatial area than we do in training. I didn't know this was the case actually, and it is something we'll need to think about if we want to use bigger NWP window sizes - especially considering around the Cornwall and east Anglia GSPs

Thats good to check, perhaps you can run training with these bigger areas first, see the effect and then we can think what to do about it. It oculd be a case of using ECMWF a bit earlier as we wont be restricted then

dfulu commented 1 year ago

This shows all the patches we are currently using. So thankfully at the minute, we should be far enough from the boundaries that this doesn't matter. This is with window size of 24 pixels. If we were to double to 48 (semi-planned at some point) I think we'll be at the cusp of needing to deal with this properly

download (18)

peterdudfield commented 1 year ago

Should we move this to a seperate github chat about making NWP bigger?

dfulu commented 1 year ago

I think @devsjc and I and I have found the grid that the production data comes on. It is using a Lambert Azimuthal Equal Area projection which I presume has been centred on the UK. Here's the link the to metoffice doc. So I think that means we will still want to reproject the production NWP data onto another grid to match the training data

peterdudfield commented 1 year ago

@dfulu you mind adding the code, that you used on GCP @devsjc can try it very quickly

The other option is the NWP-consumer saves teh data how it is. Then PVnet does the regridding

There's something quite nice about the NWP-consumer saving the data how it comes, and then pvnet needs it in a certina way, so it re does it

peterdudfield commented 1 year ago

So I think we found that the gird from the live consumer isn't quite the same as the CEDA grid. The picture above @dfulu illustrates that well. Its a few pixels different. This means the ML model will see a slightly different image (not PVnet takes an image in, not a point)

So I think the options are

re-grid the data in nwp-consumer
re-grid the data in ocf-datapipes
re-grid all the training data to the same as nwp-consumer

I think I'm tempted to put this in 2. Thoughts?

dfulu commented 1 year ago

In regards to the consumer being simple and just saving the data however it comes. I see the appeal of the idea, but I also like @jacobbieker's point that the consumer should produce data which looks the same as our raw training data. Also the form of the UKV data we receive could change in the future. What we get now is a Lambert Azimuthal Equal Area projection, but that's because of our standing order for the data. @devsjc and I were looking at the order, and found that it is possible to request the same data on a regularly spaced lat-lon grid. The spatial area we download could also change based on our data order. There is also nothing special about the Lambert Azimuthal Equal Area projection. The data has already been interpolated onto this grid, so we aren't receiving raw model outputs. We are also reformatting from grib2 to zarr. So with all those arbitrary choices already, it might be a simpler guide for the consumer to get data which is as close as possible to our training data.
I think we would need to save the regridded data to disk and then load from it. Since datapipes is more focused on preparing batches I feel like this doesn't quite fit.
This seems like a lot of computation and means we'd have another copy of the NWP data floating round. It is also dependent on our choices in that standing order of data from the metoffice
Another option could just be to put the regridding inside the app. This might be the simplest option for now. It means we don't have to tack on the regridding and saving to disk in datapipes and could keep the consumer simpler. The app is also entirely production focused anyway, so it doesn't seem too messy to have this step in it. However, it means that we'd need to add the same code to national_xg etc if we want them all to regrid.

I think my preference would be 1 or 4

peterdudfield commented 1 year ago

Thanks @dfulu for this. Maybe we should try it in 4, and see how it works, other we could move it to 1

openclimatefix / nwp-consumer

Move OSGB transformation hack out of nwp-consumer and into ocf-datapipes #26