One xarray dataset per NWP field

tomwhite commented 4 years ago

Currently a new xarray dataset (stored as Zarr) is created for every new update. Since xarray doesn't support reading multiple Zarr files at once, it would be better to append updates to a single xarray dataset along the time dimension. This would also allow us to use larger chunk sizes (generally better).

Then we would have a single dataset for each NWP field: wind speed, wind direction, irradiance, etc.

I did a quick experiment and the code would look a bit like this:

# Each update would be a single chunk or around ~4MB.
# Bigger chunks might be better still, we could do this by chunking multiple time coordinates
chunking = {
    "projection_x_coordinate": 553,
    "projection_y_coordinate": 706,
    "realization": 3,
    "height": 4,
}

dataset1 = xr.open_zarr("data/mogreps/MOGREPS-UK__wind_from_direction__2020-03-15T15__2020-03-16T07.zarr.zip")
dataset1 = dataset1.expand_dims("time") # make 1-D coordinate into a dimension
dataset1 = dataset1.chunk(chunking)

dataset2 = xr.open_zarr("data/mogreps/MOGREPS-UK__wind_from_direction__2020-03-15T15__2020-03-16T08.zarr.zip")
dataset2 = dataset2.expand_dims("time") # make 1-D coordinate into a dimension
dataset2 = dataset2.chunk(chunking)

# Create a new file
dataset1.to_zarr("data/mogreps/combined.zarr", consolidated=True)
# Append new data to the existing file
dataset2.to_zarr("data/mogreps/combined.zarr", consolidated=True, append_dim="time")

Thoughts?

JackKelly commented 4 years ago

Definitely agree this is a good idea!

How would you like the data pipeline to look for implementing this on AWS given that the Met Office files arrive out-of-order?

Should we have two micro-services: one which just dumps Met Office data to S3 in whatever order it arrives; and a second service which concatenates those files (and deletes the originals)? We could save a few quid by storing the temporary files on the VM's local disk; but then we'd have to do some work to make that resilient to the VM failing. Or is there a more elegant solution? :)

For reference, @tv3141's notebook also supports the idea of using a single dataset per NWP field because it:

reduces data read time
reduces size on disk by 4%

JackKelly commented 4 years ago

Sorry, one more quick thought: Should we have a single Zarr store per NWP field and per NWP run? e.g. we'd have a Zarr store for, say, irradiance for the NWP initialised at 2020-01-01T00; and another Zarr store for irradiance for the NWP initialised at 2020-01-01T01, etc. Does that sound right?

tomwhite commented 4 years ago

On the out-of-order problem, your solution sounds like a good one. Do you know what the maximum lag is? If it's long enough to start impacting the timeliness of predictions, then they can be run from the out-out-order individual files, and the ordered file can be created later for the purpose of training models.

Should we have a single Zarr store per NWP field and per NWP run? e.g. we'd have a Zarr store for, say, irradiance for the NWP initialised at 2020-01-01T00; and another Zarr store for irradiance for the NWP initialised at 2020-01-01T01, etc. Does that sound right?

No, I think there should be a single Zarr store, chunked along the time dimension. Then if you wanted a certain time range you would only have to load the relevant chunks. If you had one store per time, then you would have to open multiple stores to analyse a bigger time range, and the problem with this is that xarray doesn't have good support for it (afaict).

JackKelly commented 4 years ago

Do you know what the maximum lag is?

I'm afraid I don't know for sure! I would guess the lag is "pretty small" (tens of seconds?!?) But I'm not really sure!

Regarding multiple Zarr stores for different init times: One of the things that makes talking about NWPs confusing is that there are two times dimensions that we care about: The initialisation time (the time that Met Office started computing the forecasts) and the forecast time (the time that the forecast is about). In the case of MOGREPs, the Met Office run 3 ensemble members (aka 'realisations') concurrently every hour, and each run provides a forecast for the next 5 days.

To give a concrete example:

At 2020-01-01T00 the Met Office computed 3 ensemble members, each of which provides forecasts for the range [2020-01-01T00, 2020-01-05T23] at hourly intervals.

Then, an hour later, at 2020-01-01T01, the Met Office kicked off another 3 ensemble members, providing forecasts for the range [2020-01-01T01, 2020-01-06T00].

(Actually, to be pedantic, there there three time dimensions we care about: 1) init time, 2) forecast time, and 3) the time the NWPs become available to our code, which is at least 24 hours while we're using the free NWPs on AWS!)

I definitely agree that, for a given NWP field and a given init time, all forecast times should go into a single Zarr store. But should we have a separate Zarr store for each init time?

tomwhite commented 4 years ago

That makes a lot of sense, thanks for the clear explanation.

So yes, perhaps the simplest thing is to have a separate Zarr store for each init time. Larger consolidated datasets can be built later as needed.

openclimatefix / metoffice_ec2

One xarray dataset per NWP field #17