pangeo-data / distributed-array-examples

12 stars 0 forks source link

Climatological anomalies #4

Open dcherian opened 1 year ago

dcherian commented 1 year ago

Calculate the anomaly with respect to the group mean, a very common operation.

This example uses ERA5 data.

from intake import open_catalog

cat = open_catalog("https://raw.githubusercontent.com/pangeo-data/pangeo-datastore/master/intake-catalogs/atmosphere.yaml")

ds = cat['era5_hourly_reanalysis_single_levels_sa'].to_dask()
ds
image
# with flox, this mean calculation should be straightforward
import flox.xarray  # required

# ideally we would use the default method="cohorts" but that's not very optimal at the moment
mean = ds.groupby("time.dayofyear").mean(method="map-reduce")
image
# A user would do ds.groupby('time.dayofyear') - mean
# but this is what Xarray does under the hood, and it is clearer
# about what's happening
# The chunking to 1 *should* make this work better below, but wouldn't expect the average user to do it.
anomaly = ds - mean.sel(dayofyear=ds.time.dt.dayofyear)
anomaly
image
mrocklin commented 1 year ago

Do you happen to have something like this living in AWS?

dcherian commented 1 year ago

https://registry.opendata.aws/ecmwf-era5/ But I can't tell what the zarr chunking is. I don't have access to a AWS cluster anymore...

dcherian commented 1 year ago

I should also say, we can reduce complexity here by just working with a few variables, not all 19 or so.

mrocklin commented 1 year ago

I don't have access to a AWS cluster anymore...

Do you want access to an AWS cluster? We've gotten pretty good at providing those ...