Integrating the proxy into the data viewer - progress update and performance observations and other issues

andersy005 commented 1 year ago

@katamartin and I have been making progress in integrating the proxy into the data viewer. Our intention is to use the proxy for on-the-fly rechunking of datasets for visualization purposes. The results are looking promising and the performance is satisfactory (for small datasets and datasets hosted in AWS S3) even without caching on the backend

https://storage.googleapis.com/carbonplan-maps/ncview/demo/single_timestep/air_temperature.zarr

Screenshot 2023-01-25 at 10 48 19

s3://carbonplan-data-viewer/demo/MURSST.zarr ( the original chunk size is roughly ~ 1.21 GB)
retrieving data from stores hosted outside outside of S3 takes a long time (as expected). the following are timings for gs://ldeo-glaciology/bedmachine/bm.zarr (the original chunk size is roughly ~ 35MB)

Screenshot 2023-01-25 at 11 54 05

there's still more work to do to ensure seamless interoperability with existing zarr clients. To illustrate this, below is a code snippet that demonstrates how the proxy can be used via the zarr Python library.

instantiate a zarr store via fsspec

In [21]: url = 'http://127.0.0.1:8000/storage.googleapis.com/ldeo-glaciology/bedmachine/bm.zarr'

In [22]: store = zarr.storage.FSStore(url, client_kwargs={'headers': {"chunks": "10,10"}})

In [23]: store['.zattrs']
Out[23]: b'{"Author":"Mathieu Morlighem","Conventions":"CF-1.7","Data_citation":"Morlighem M. et al., (2019), Deep glacial troughs and stabilizing ridges unveiled beneath the margins of the Antarctic ice sheet, Nature Geoscience (accepted)","Notes":"Data processed at the Department of Earth System Science, University of California, Irvine","Projection":"Polar Stereographic South (71S,0E)","Title":"BedMachine Antarctica","false_easting":[0.0],"false_northing":[0.0],"grid_mapping_name":"polar_stereographic","ice_density (kg m-3)":[917.0],"inverse_flattening":[298.2794050428205],"latitude_of_projection_origin":[-90.0],"license":"No restrictions on access or use","no_data":[-9999.0],"nx":[13333.0],"ny":[13333.0],"proj4":"+init=epsg:3031","sea_water_density (kg m-3)":[1027.0],"semi_major_axis":[6378273.0],"spacing":[500],"standard_parallel":[-71.0],"straight_vertical_longitude_from_pole":[0.0],"version":"05-Nov-2019 (v1.38)","xmin":[-3333000],"ymax":[3333000]}'

open an array within the zarr store

In [25]: arr = zarr.open(store, path='/bed')

In [27]: arr
Out[27]: <zarr.core.Array '/bed' (13333, 13333) float32>

retrieve some data

In [28]: arr[:10, :10]
Out[28]: 
array([[-5914.538 , -5919.3955, -5924.865 , -5930.3765, -5935.8853,
        -5941.0205, -5945.997 , -5950.359 , -5954.3784, -5958.045 ],
       [-5910.384 , -5915.8296, -5921.3076, -5927.158 , -5932.7554,
        -5938.29  , -5943.1704, -5947.785 , -5951.881 , -5955.54  ],
       [-5906.422 , -5911.8516, -5917.63  , -5923.6133, -5929.573 ,
        -5935.029 , -5940.271 , -5944.9736, -5949.237 , -5952.898 ],
       [-5902.613 , -5908.093 , -5914.061 , -5920.044 , -5925.9707,
        -5931.7017, -5937.0083, -5941.9688, -5946.243 , -5950.265 ],
       [-5899.054 , -5904.7085, -5910.5   , -5916.532 , -5922.4585,
        -5928.2095, -5933.64  , -5938.608 , -5943.3335, -5947.362 ],
       [-5895.9683, -5901.283 , -5907.2   , -5913.2   , -5919.1235,
        -5924.6836, -5930.077 , -5935.3584, -5940.0796, -5944.544 ],
       [-5892.8423, -5898.332 , -5904.08  , -5910.0503, -5915.838 ,
        -5921.344 , -5926.583 , -5931.785 , -5936.9224, -5941.452 ],
       [-5890.067 , -5895.4604, -5901.1587, -5906.9365, -5912.6836,
        -5918.2617, -5923.3687, -5928.1724, -5933.3447, -5937.538 ],
       [-5887.37  , -5892.716 , -5898.2046, -5903.9224, -5909.691 ,
        -5915.144 , -5920.3755, -5925.193 , -5928.876 , -5933.021 ],
       [-5884.786 , -5890.015 , -5895.455 , -5900.958 , -5906.5366,
        -5912.1353, -5917.4043, -5921.5264, -5925.1343, -5928.5483]],
      dtype=float32)

if we attempt to access a variable whose dimensionality does not match the specified chunks in the HTTP headers, it causes issues or failure

. for instance, in our store, x is 1D, and the chunks we specified earlier are 10,10 as defined in zarr.storage.FSStore(url, client_kwargs={'headers': {"chunks": "10,10"}})

In [29]: store['x/.zarray']
Out[29]: b'{"chunks":[10,10],"compressor":null,"dtype":"<i4","fill_value":null,"filters":[],"order":"C","shape":[13333],"zarr_format":2}'

In [30]: store['x/0']
---------------------------------------------------------------------------
ClientResponseError                       Traceback (most recent call last)
Cell In[30], line 1
----> 1 store['x/0']

ClientResponseError: 500, message='Internal Server Error', url=URL('http://127.0.0.1:8000/storage.googleapis.com/ldeo-glaciology/bedmachine/bm.zarr/x/0')

It would be nice if there's a way to override the headers via fsspec.

I am also CC-ing some folks (@freeman-lab, @norlandrhagen, @jhamman, @rabernat) who might be interested in this, to keep them in the loop of our progress

andersy005 commented 1 year ago

I wanted to make a note that the timings and screenshots above were obtained while running the zarr-proxy via AWS Lambda.

rabernat commented 1 year ago

The problem is that a single chunk shape header is being provided to the entire group.

I see two high level ways of resolving this:

Only Proxy Arrays

If we just never attempt to open groups, we don't have this problem. The sequence look like this

First open the consolidated metadata to discover all of the variables, shapes and chunks. (With no chunk header provided, the proxy should pass the chunks through unchanged from the underlying store.) This only needs to be done one time, when the data viewer session is being set up.
Based on this information, the client decides what chunking it want to receive from the proxy for each array
Now it's time to get arrays arrays. Construct a request for each array with the desired chunk header and open those paths directly.

This is not compatible with how we tend to use Xarray, Zarr, and FSSpec from Python. There we tend to open the group and thus can't specialize the headers be different for different arrays. But it would work fine in plain Zarr. And it may be feasible from javascript land.

Is Xarray support required here?

Scope the header to specific arrays

We could to scope the header specify different chunks for different object. Instead of

{"chunks": "10,10"}

what about

{
    "chunks":
    {
        "storage.googleapis.com/ldeo-glaciology/bedmachine/bm.zarr/bed": "10,10"
    }
}

The steps to set up reading would be as followed. The first two are the same as above.

First open the consolidated metadata to discover all of the variables, shapes and chunks. (With no chunk header provided, the proxy should pass the chunks through unchanged from the underlying store.) This only needs to be done one time, when the data viewer session is being set up.
Based on this information, the client decides what chunking it want to receive from the proxy for each array
Now the client constructs this more complex header, and re-opens the consolidated metadata with chunks specified for each array within the group.

The tricky bit here is aligning the paths specified in the header with the paths specified in the URL. But this method should also work with Xarray.

andersy005 commented 1 year ago

thank you for chiming in, @rabernat. I've implemented a more complex chunks header in #7 and @katamartin and i are wondering if we need the full path in the header key or if the keys can be relative to the path?

{
        "chunks": 
        {
                "bed" "10,10", 
                "x": 5,
        }

}

andersy005 commented 1 year ago

After tinkering with the new approach for specifying chunks headers in #7, I'm happy to report that everything seems to be working with both Xarray and Zarr. The key piece here is that we are now accepting chunks headers along the .zmetadata route. When chunks are specified, we modify the 'chunks' for the specified variables and override the compressor by setting it to None for all variables, since we are sending raw bytes.

In [5]: import xarray as xr, zarr

In [6]: chunks='bed=10,10,mask=20,20'

In [7]: url = 'http://127.0.0.1:8000/storage.googleapis.com/ldeo-glaciology/bedmachine/bm.zarr'

In [8]: store = zarr.storage.FSStore(url, client_kwargs={'headers': {"chunks": chunks}})

In [9]: ds = xr.open_dataset(store, engine='zarr', chunks={})

In [10]: ds
Out[10]: 
<xarray.Dataset>
Dimensions:    (y: 13333, x: 13333)
Coordinates:
  * x          (x) int32 -3333000 -3332500 -3332000 ... 3332000 3332500 3333000
  * y          (y) int32 3333000 3332500 3332000 ... -3332000 -3332500 -3333000
Data variables:
    bed        (y, x) float32 dask.array<chunksize=(10, 10), meta=np.ndarray>
    errbed     (y, x) float32 dask.array<chunksize=(3000, 3000), meta=np.ndarray>
    firn       (y, x) float32 dask.array<chunksize=(3000, 3000), meta=np.ndarray>
    geoid      (y, x) int16 dask.array<chunksize=(3000, 3000), meta=np.ndarray>
    mask       (y, x) int8 dask.array<chunksize=(20, 20), meta=np.ndarray>
    source     (y, x) int8 dask.array<chunksize=(3000, 3000), meta=np.ndarray>
    surface    (y, x) float32 dask.array<chunksize=(3000, 3000), meta=np.ndarray>
    thickness  (y, x) float32 dask.array<chunksize=(3000, 3000), meta=np.ndarray>
Attributes: (12/25)
    Author:                                 Mathieu Morlighem
    Conventions:                            CF-1.7
    Data_citation:                          Morlighem M. et al., (2019), Deep...
    Notes:                                  Data processed at the Department ...
    Projection:                             Polar Stereographic South (71S,0E)
    Title:                                  BedMachine Antarctica
    ...                                     ...
    spacing:                                [500]
    standard_parallel:                      [-71.0]
    straight_vertical_longitude_from_pole:  [0.0]
    version:                                05-Nov-2019 (v1.38)
    xmin:                                   [-3333000]
    ymax:                                   [3333000]

In [12]: ds.isel(x=range(2), y=range(2)).bed.compute()
Out[12]: 
<xarray.DataArray 'bed' (y: 2, x: 2)>
array([[-5914.538 , -5919.3955],
       [-5910.384 , -5915.8296]], dtype=float32)
Coordinates:
  * x        (x) int32 -3333000 -3332500
  * y        (y) int32 3333000 3332500
Attributes:
    grid_mapping:   mapping
    long_name:      bed topography
    source:         IBCSO and Mathieu Morlighem
    standard_name:  bedrock_altitude
    units:          meters

rabernat commented 1 year ago

Is there any live demo I could peek at?

katamartin commented 1 year ago

@rabernat yeah, you should be able to play around with this: https://756xnpgrdy6om3hgr5wxyxvnzm0ecwcg.lambda-url.us-west-2.on.aws

rabernat commented 1 year ago

I guess i meant an actual map. 😉

katamartin commented 1 year ago

Aha yeah the link for the map is https://ncview-js.staging.carbonplan.org/, but the app is definitely not stable 😅. We're currently troubleshooting the integration with the newly added validations.

pangeo-data / zarr-proxy

Integrating the proxy into the data viewer - progress update and performance observations and other issues #6

Only Proxy Arrays

Scope the header to specific arrays