Consider pulling real data from cloud in tutorials with zarr

nci / scores

Metrics for the verification, evaluation and optimisation of forecasts, predictions or models.

https://scores.readthedocs.io/

Apache License 2.0

63 stars 17 forks source link

Consider pulling real data from cloud in tutorials with zarr #662

Open nicholasloveday opened 1 month ago

nicholasloveday commented 1 month ago

Some of the tutorials use real data. These don't work as well in binder as you need to download the data and save it to disk.

An alternative approach would be to pull the data from the cloud into memory.

E.g.,

hres = xr.open_zarr('gs://weatherbench2/datasets/hres/2016-2022-0012-1440x721.zarr')
hres["2m_temperature"].sel(time="2020-01-01T00:00:00", prediction_timedelta=pd.Timedelta("1 days")).plot()

takes 1.5 seconds to pull the ECMWF forecast down from the cloud and plot it.

The dependencies that would need to be added for the tutorials are zarr and gcsfs

There is also reanalysis data and data driven models that can be pulled down from the cloud (see https://weatherbench2.readthedocs.io/en/latest/data-guide.html).

You can get data on the same grid so it makes verification with scores super easy!

Something to discuss

nicholasloveday commented 1 month ago

Okay, it looks like it's slower (~20s) to establish the initial connection with the Google storage and then pulling data down is fast

nicholasloveday commented 1 month ago

Here's an example of verifying GraphCast against ERA5 with scores

from scores.continuous import mse
import xarray as xr
import pandas as pd

graphcast = xr.open_zarr(
    "gs://weatherbench2/datasets/graphcast/2020/date_range_2019-11-16_2021-02-01_12_hours_derived.zarr"
)
fcst = graphcast["2m_temperature"].sel(
    time="2020-01-01T00:00:00", prediction_timedelta=pd.Timedelta("5 days")
)
era5 = xr.open_zarr("gs://weatherbench2/datasets/era5-forecasts/2020-1440x721.zarr")
obs = era5["2m_temperature"].sel(
    time="2020-01-01T00:00:00", prediction_timedelta=pd.Timedelta("0 days")
)
fcst = fcst.compute()
obs = obs.compute()
obs = obs.rename({"latitude": "lat", "longitude": "lon"})
result = mse(fcst, obs, preserve_dims="all")
result.plot(vmax=100)

which also takes ~1.5 seconds after an initial connection to the cloud storage

tennlee commented 1 month ago

If that works well in binder, that's good to know. I'd like some testing to be done before we proceed, and it might be a few weeks before I could do this myself. I'd be happy to see a new notebook created on a branch which we can then develop and test. It might be nice to have an ML-focused notebook which goes into some new areas, possibly looking at evaluation more than the use of individual scores. Thanks very much for putting your example together.

nicholasloveday commented 1 month ago

Yes - I agree that we need to test this in binder. It worked well on my laptop, so it may be an improvement for people running it locally on their computer.

A ML focused notebook sounds like a great idea. A scores + weatherbench2 tutorial would be quite nice.

tennlee commented 1 month ago

I've started a branch for this on my fork, based on your sample code. I won't be able to do this very quickly, so if you want to do something more quickly, feel free. It required the packages "zarr" and "gcsfs" to be installed. Zarr is a common format to want, so that's fine to add to the tutorial requirements, and I think there will be enough interest in these datasets to justify adding gcsfs to the tutorial requirements also. Another option would be to simply document those requirements in the notebook itself. But if the goal is to have this work nicely in binder, it's probably more reliable to add them to the requirements. If you'd like to, I'm happy to add you to my fork as well and you can push directly to the feature branch if you want to.