Open nicholasloveday opened 1 month ago
Okay, it looks like it's slower (~20s) to establish the initial connection with the Google storage and then pulling data down is fast
Here's an example of verifying GraphCast against ERA5 with scores
from scores.continuous import mse
import xarray as xr
import pandas as pd
graphcast = xr.open_zarr(
"gs://weatherbench2/datasets/graphcast/2020/date_range_2019-11-16_2021-02-01_12_hours_derived.zarr"
)
fcst = graphcast["2m_temperature"].sel(
time="2020-01-01T00:00:00", prediction_timedelta=pd.Timedelta("5 days")
)
era5 = xr.open_zarr("gs://weatherbench2/datasets/era5-forecasts/2020-1440x721.zarr")
obs = era5["2m_temperature"].sel(
time="2020-01-01T00:00:00", prediction_timedelta=pd.Timedelta("0 days")
)
fcst = fcst.compute()
obs = obs.compute()
obs = obs.rename({"latitude": "lat", "longitude": "lon"})
result = mse(fcst, obs, preserve_dims="all")
result.plot(vmax=100)
which also takes ~1.5 seconds after an initial connection to the cloud storage
If that works well in binder, that's good to know. I'd like some testing to be done before we proceed, and it might be a few weeks before I could do this myself. I'd be happy to see a new notebook created on a branch which we can then develop and test. It might be nice to have an ML-focused notebook which goes into some new areas, possibly looking at evaluation more than the use of individual scores. Thanks very much for putting your example together.
Yes - I agree that we need to test this in binder. It worked well on my laptop, so it may be an improvement for people running it locally on their computer.
A ML focused notebook sounds like a great idea. A scores + weatherbench2 tutorial would be quite nice.
I've started a branch for this on my fork, based on your sample code. I won't be able to do this very quickly, so if you want to do something more quickly, feel free. It required the packages "zarr" and "gcsfs" to be installed. Zarr is a common format to want, so that's fine to add to the tutorial requirements, and I think there will be enough interest in these datasets to justify adding gcsfs to the tutorial requirements also. Another option would be to simply document those requirements in the notebook itself. But if the goal is to have this work nicely in binder, it's probably more reliable to add them to the requirements. If you'd like to, I'm happy to add you to my fork as well and you can push directly to the feature branch if you want to.
Some of the tutorials use real data. These don't work as well in binder as you need to download the data and save it to disk.
An alternative approach would be to pull the data from the cloud into memory.
E.g.,
takes 1.5 seconds to pull the ECMWF forecast down from the cloud and plot it.
The dependencies that would need to be added for the tutorials are
zarr
andgcsfs
There is also reanalysis data and data driven models that can be pulled down from the cloud (see https://weatherbench2.readthedocs.io/en/latest/data-guide.html).
You can get data on the same grid so it makes verification with scores super easy!
Something to discuss