pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.59k stars 1.08k forks source link

Example on using `preprocess` with `mfdataset` #2313

Open dcherian opened 6 years ago

dcherian commented 6 years ago

I wrote this little notebook today while trying to get some satellite data in form that was nice to work with: https://gist.github.com/dcherian/66269bc2b36c2bc427897590d08472d7

I think it would make a useful example for the docs.

A few questions:

  1. Do you think it'd be a good addition to the examples?
  2. Is this the recommended way of adding meaningful co-ordinates, expanding dims etc.? The main bit is this function:

    def preprocess(ds):
    
        dsnew = ds.copy()
        dsnew['latitude'] = xr.DataArray(np.linspace(90, -90, 180),
                                         dims=['phony_dim_0'])
        dsnew['longitude'] = xr.DataArray(np.linspace(-180, 180, 360),
                                          dims=['phony_dim_1'])
        dsnew = (dsnew.rename({'l3m_data': 'sss',
                               'phony_dim_0': 'latitude',
                               'phony_dim_1': 'longitude'})
                 .set_coords(['latitude', 'longitude'])
                 .drop('palette'))
    
        dsnew['time'] = (pd.to_datetime(dsnew.attrs['time_coverage_start'])
                         + np.timedelta64(3, 'D') + np.timedelta64(12, 'h'))
        dsnew = dsnew.expand_dims('time').set_coords('time')
    
        return dsnew

Also open to other feedback...

fujiisoup commented 6 years ago

There is a related question on SO.

I think it is a good idea to add an example to our doc. Agreed for the question 1. I did not yet examine the question 2, but I think simple examples are generally nice.

raybellwaves commented 3 years ago

Edit: Copied and pasted from a duplicate issue I opened. Closing that and moving convo here.

@jhamman's SO answer circa 2018 helped me this week https://stackoverflow.com/a/51714004/6046019

I wonder if it's worth (not sure where) providing an example of how to use preprocesses with open_mfdataset?

Add an Examples entry to the doc string? (http://xarray.pydata.org/en/latest/generated/xarray.open_mfdataset.html / https://github.com/pydata/xarray/blob/5296ed18272a856d478fbbb3d3253205508d1c2d/xarray/backends/api.py#L895)

While not a small example (as the remote files are large) this is how I used it:

import xarray as xr
import s3fs

def preprocess(ds):
    return ds.expand_dims('time')

fs = s3fs.S3FileSystem(anon=True)
f1 = fs.open('s3://fmi-opendata-rcrhirlam-surface-grib/2021/02/03/00/numerical-hirlam74-forecast-MaximumWind-20210203T000000Z.grb2')
f2 = fs.open('s3://fmi-opendata-rcrhirlam-surface-grib/2021/02/03/06/numerical-hirlam74-forecast-MaximumWind-20210203T060000Z.grb2')

ds = xr.open_mfdataset([f1, f2], engine="cfgrib", preprocess=preprocess, parallel=True)

with one file looking like:

xr.open_dataset("LOCAL_numerical-hirlam74-forecast-MaximumWind-20210203T000000Z.grb2", engine="cfgrib")
<xarray.Dataset>
Dimensions:            (latitude: 947, longitude: 5294, step: 55)
Coordinates:
    time               datetime64[ns] ...
  * step               (step) timedelta64[ns] 01:00:00 ... 2 days 07:00:00
    heightAboveGround  int64 ...
  * latitude           (latitude) float64 25.65 25.72 25.78 ... 89.86 89.93 90.0
  * longitude          (longitude) float64 -180.0 -179.9 -179.9 ... 179.9 180.0
    valid_time         (step) datetime64[ns] ...
Data variables:
    fg10               (step, latitude, longitude) float32 ...
Attributes:
    GRIB_edition:            2
    GRIB_centre:             ecmf
    GRIB_centreDescription:  European Centre for Medium-Range Weather Forecasts
    GRIB_subCentre:          0
    Conventions:             CF-1.7
    institution:             European Centre for Medium-Range Weather Forecasts
    history:                 2021-02-12T18:06:52 GRIB to CDM+CF via cfgrib-0....

A smaller example could be (WIP; note I was hoping ds would concat along t but it doesn't do what I expect)

import numpy as np
import xarray as xr

f1 = xr.DataArray(np.arange(2), coords=[np.arange(2)], dims=["a"], name="f1")
f1 = f1.assign_coords(t=0)
f1.to_dataset().to_zarr("f1.zarr") # What's the best way to store small files to open again with mf_dataset? csv via xarray objects? can you use open_mfdataset on pkl objects?

f2 = xr.DataArray(np.arange(2), coords=[np.arange(2)], dims=["a"], name="f2")
f2 = f2.assign_coords(t=1)
f2.to_dataset().to_zarr("f2.zarr")

# Concat along t
def preprocess(ds):
    return ds.expand_dims('t')
ds = xr.open_mfdataset(["f1.zarr", "f2.zarr"], engine="zarr", concat_dim="t", preprocess=preprocess)
>>> ds
<xarray.Dataset>
Dimensions:  (a: 2, t: 1)
Coordinates:
  * t        (t) int64 0
  * a        (a) int64 0 1
Data variables:
    f1       (t, a) int64 dask.array<chunksize=(1, 2), meta=np.ndarray>
    f2       (t, a) int64 dask.array<chunksize=(1, 2), meta=np.ndarray>
chuaxr commented 2 years ago

Seconding @dcherian's comment in #4901 on an example for .encoding['source']. Working off @raybellwaves' example, something like this would have been useful to me:

>>> import xarray as xr
>>> import numpy as np
>>> model1 = xr.DataArray(np.arange(2), coords=[np.arange(2)], name="f")
>>> model1.to_dataset().to_netcdf("model1.nc")
>>> model2 = xr.DataArray(np.arange(2), coords=[np.arange(2)], name="f")
>>> model2.to_dataset().to_netcdf("model2.nc")
>>> ds = xr.open_mfdataset(
...     ["model1.nc", "model2.nc"],
...     preprocess=lambda ds: ds.expand_dims(
...         {"model_name": [ds.encoding["source"].split("/")[-1].split(".")[0]]}
...     ),
... )
>>> ds
<xarray.Dataset>
Dimensions:     (dim_0: 2, model_name: 2)
Coordinates:
  * dim_0       (dim_0) int64 0 1
  * model_name  (model_name) object 'model1' 'model2'
Data variables:
    f           (model_name, dim_0) int64 dask.array<chunksize=(1, 2), meta=np.ndarray>

On that note, the example above seems to work with some slight changes:

>>> import numpy as np
>>> import xarray as xr
>>> 
>>> f1 = xr.DataArray(np.arange(2), coords=[np.arange(2)], dims=["a"], name="f1")
>>> f1 = f1.assign_coords(t='t0')
>>> f1.to_dataset().to_netcdf("f1.nc")
>>> 
>>> f2 = xr.DataArray(np.arange(2), coords=[np.arange(2)], dims=["a"], name="f2")
>>> f2 = f2.assign_coords(t='t1')
>>> f2.to_dataset().to_netcdf("f2.nc")
>>> 
>>> # Concat along t
>>> def preprocess(ds):
...     return ds.expand_dims("t")
... 
>>> 
>>> ds = xr.open_mfdataset(["f1.nc", "f2.nc"], concat_dim="t", preprocess=preprocess)
>>> ds
<xarray.Dataset>
Dimensions:  (a: 2, t: 2)
Coordinates:
  * t        (t) object 't0' 't1'
  * a        (a) int64 0 1
Data variables:
    f1       (t, a) float64 dask.array<chunksize=(2, 2), meta=np.ndarray>
    f2       (t, a) float64 dask.array<chunksize=(2, 2), meta=np.ndarray>
jibcar commented 2 years ago

Hello:

I have to find maximum precipitation of each year (for example: 2007 and 2008, Dataset link are: 2007 and 2008). I have done this using resample method (i.e. .resample(time='Y').max()) after concatenating it along time dimension.

Following along SO, I am wondering if I can use preprocess to find maximum (or minimum or average) for each file first and then concatenate it using time dimension. I tried the following code and was not successful. Can someone help me with this?


import numpy as np
import xarray as xr

from dask.distributed import Client
client = Client()
client

def preprocess_func(ds):
    '''Get maximum (or minimum or average) from each file and concatenate along time'''
    return ds.precip.max('time')

prec_ds=xr.open_mfdataset([prec_2007,prec_2008],
                       chunks={"lat": 25,"lon": 25,"time": -1,},
                       preprocess=preprocess_func,
                       concat_dim='time')```
dcherian commented 2 years ago

I bet you need to ds.precip.max("time", keepdims=True) so that the returned value has the time dimension. Please ask usage questions at https://github.com/pydata/xarray/discussions next time

husainridwan commented 1 year ago

I'll like to work on this @TomNicholas, where do I start from?