Open andersy005 opened 1 year ago
I wanted to make a note that the timings and screenshots above were obtained while running the zarr-proxy via AWS Lambda.
The problem is that a single chunk shape header is being provided to the entire group.
I see two high level ways of resolving this:
If we just never attempt to open groups, we don't have this problem. The sequence look like this
This is not compatible with how we tend to use Xarray, Zarr, and FSSpec from Python. There we tend to open the group and thus can't specialize the headers be different for different arrays. But it would work fine in plain Zarr. And it may be feasible from javascript land.
Is Xarray support required here?
We could to scope the header specify different chunks for different object. Instead of
{"chunks": "10,10"}
what about
{
"chunks":
{
"storage.googleapis.com/ldeo-glaciology/bedmachine/bm.zarr/bed": "10,10"
}
}
The steps to set up reading would be as followed. The first two are the same as above.
The tricky bit here is aligning the paths specified in the header with the paths specified in the URL. But this method should also work with Xarray.
thank you for chiming in, @rabernat. I've implemented a more complex chunks header in #7 and @katamartin and i are wondering if we need the full path in the header key or if the keys can be relative to the path?
{
"chunks":
{
"bed" "10,10",
"x": 5,
}
}
After tinkering with the new approach for specifying chunks headers in #7, I'm happy to report that everything seems to be working with both Xarray and Zarr. The key piece here is that we are now accepting chunks headers along the .zmetadata
route. When chunks are specified, we modify the 'chunks'
for the specified variables and override the compressor by setting it to None for all variables, since we are sending raw bytes.
In [5]: import xarray as xr, zarr
In [6]: chunks='bed=10,10,mask=20,20'
In [7]: url = 'http://127.0.0.1:8000/storage.googleapis.com/ldeo-glaciology/bedmachine/bm.zarr'
In [8]: store = zarr.storage.FSStore(url, client_kwargs={'headers': {"chunks": chunks}})
In [9]: ds = xr.open_dataset(store, engine='zarr', chunks={})
In [10]: ds
Out[10]:
<xarray.Dataset>
Dimensions: (y: 13333, x: 13333)
Coordinates:
* x (x) int32 -3333000 -3332500 -3332000 ... 3332000 3332500 3333000
* y (y) int32 3333000 3332500 3332000 ... -3332000 -3332500 -3333000
Data variables:
bed (y, x) float32 dask.array<chunksize=(10, 10), meta=np.ndarray>
errbed (y, x) float32 dask.array<chunksize=(3000, 3000), meta=np.ndarray>
firn (y, x) float32 dask.array<chunksize=(3000, 3000), meta=np.ndarray>
geoid (y, x) int16 dask.array<chunksize=(3000, 3000), meta=np.ndarray>
mask (y, x) int8 dask.array<chunksize=(20, 20), meta=np.ndarray>
source (y, x) int8 dask.array<chunksize=(3000, 3000), meta=np.ndarray>
surface (y, x) float32 dask.array<chunksize=(3000, 3000), meta=np.ndarray>
thickness (y, x) float32 dask.array<chunksize=(3000, 3000), meta=np.ndarray>
Attributes: (12/25)
Author: Mathieu Morlighem
Conventions: CF-1.7
Data_citation: Morlighem M. et al., (2019), Deep...
Notes: Data processed at the Department ...
Projection: Polar Stereographic South (71S,0E)
Title: BedMachine Antarctica
... ...
spacing: [500]
standard_parallel: [-71.0]
straight_vertical_longitude_from_pole: [0.0]
version: 05-Nov-2019 (v1.38)
xmin: [-3333000]
ymax: [3333000]
In [12]: ds.isel(x=range(2), y=range(2)).bed.compute()
Out[12]:
<xarray.DataArray 'bed' (y: 2, x: 2)>
array([[-5914.538 , -5919.3955],
[-5910.384 , -5915.8296]], dtype=float32)
Coordinates:
* x (x) int32 -3333000 -3332500
* y (y) int32 3333000 3332500
Attributes:
grid_mapping: mapping
long_name: bed topography
source: IBCSO and Mathieu Morlighem
standard_name: bedrock_altitude
units: meters
Is there any live demo I could peek at?
@rabernat yeah, you should be able to play around with this: https://756xnpgrdy6om3hgr5wxyxvnzm0ecwcg.lambda-url.us-west-2.on.aws
I guess i meant an actual map. 😉
Aha yeah the link for the map is https://ncview-js.staging.carbonplan.org/, but the app is definitely not stable 😅. We're currently troubleshooting the integration with the newly added validations.
@katamartin and I have been making progress in integrating the proxy into the data viewer. Our intention is to use the proxy for on-the-fly rechunking of datasets for visualization purposes. The results are looking promising and the performance is satisfactory (for small datasets and datasets hosted in AWS S3) even without caching on the backend
https://storage.googleapis.com/carbonplan-maps/ncview/demo/single_timestep/air_temperature.zarr
s3://carbonplan-data-viewer/demo/MURSST.zarr
( the original chunk size is roughly ~ 1.21 GB)retrieving data from stores hosted outside outside of S3 takes a long time (as expected). the following are timings for
gs://ldeo-glaciology/bedmachine/bm.zarr
(the original chunk size is roughly ~ 35MB)there's still more work to do to ensure seamless interoperability with existing zarr clients. To illustrate this, below is a code snippet that demonstrates how the proxy can be used via the zarr Python library.
if we attempt to access a variable whose dimensionality does not match the specified chunks in the HTTP headers, it causes issues or failure
. for instance, in our store,
x
is 1D, and the chunks we specified earlier are10,10
as defined inzarr.storage.FSStore(url, client_kwargs={'headers': {"chunks": "10,10"}})
It would be nice if there's a way to override the headers via fsspec.
I am also CC-ing some folks (@freeman-lab, @norlandrhagen, @jhamman, @rabernat) who might be interested in this, to keep them in the loop of our progress