Zarr loader - Githubissues

rabernat commented 3 years ago

Hello and thanks for all of your work on this incredible open source package and ecosystem!

At our Pangeo Community meeting today, we discussed our wish to integrate cloud-based weather and climate data stored in the Zarr format with the deck.gl ecosystem. (I noticed this has been discussed before in #1140.) There is a Zarr reader in javascript (zarr.js), so maybe that makes it easier. I understand that #1140 probably has to be resolved to make this possible, but I thought I'd just open a dedicated issue to track the idea.

Tagging @kylebarron, @point9repeating and @manzt who may be interested.

manzt commented 3 years ago

Thanks for pinging me @rabernat. Hoping I can share lessons learned from zarr.js & Viv.

kylebarron commented 3 years ago

Hey @rabernat, happy to see you in this neck of the Github woods!

I think Zarr is a good fit for a new loader. I would expect it would be a thin wrapper around zarr.js (or maybe zarr-lite). Ideally we'll have a two-step loader so that the first step can instantiate the store and read the metadata (once) and the second step will load a single chunk, like const chunk = await z.getRawChunk('0.0.0').

Questions:

Do we need finely-tuned indexing? Other tiled loaders, such as the MVT, Terrain, and Quantized Mesh loaders, only have a use case of loading entire tiles at a time, but it may be different here. In this case it that would mean "load an entire zarr chunk at a time".
If we're ok with loading entire tiles/zarr chunks at a time, could we use something like what zarr-lite/core provides, for a lower bundle size?
Rendering is a separate question from loading. Since Zarr chunks can be in any projection/tiling, if you want to render in deck, you might need to override the TileLayer's tile referencing. You might want to read/follow/comment on https://github.com/visgl/deck.gl/pull/5504.

@ibgreen likely has some feedback, but he may be slow to respond the next few days.

point9repeating commented 3 years ago

Hello!

I spent a little time getting reacquainted with deck.gl / luma.gl (it's been a couple years since I've used these libraries) and was able to throw together a proof of concept rendering a geospatial zarr data set using deck.gl + zarr.js:

This data set is the NCEP NAM forecast I had available locally [2m relative humidity with a simple red/blue color bar].

I thought it might be useful to throw up a version of this pointed at an open cloud-hosted data set to help us discuss any potential additions to luma.gl or deck.gl. Unfortunately, I'm struggling to find a data set that has a CORS-enabled http front-end. I tried MUR and HRRR.

It looks like the HRRR bucket is configured with static website hosting: http://hrrrzarr.s3-website-us-west-1.amazonaws.com/sfc/20210414/20210414_00z_fcst.zarr/surface/TMP/projection_y_coordinate/.zarray

but CORS has not been configured to allow requests from other origins.

rabernat commented 3 years ago

One issue (noted in https://github.com/zarr-developers/community/issues/37) is that many of the existing cloud-based Zarr data is optimized for backend analytics and consequently has chunk sizes of ~100 MB. This is probably much too big for interactive browser-based visualization.

What is an optimal chunk size for deck.gl? I'll try to prepare some Zarr data with much smaller chunks and set up an appropriate CORS policy.

kylebarron commented 3 years ago

This is probably much too big for interactive browser-based visualization.

One question is whether the compression algorithm applied to each block supports streaming decompression. For example if you use gzip compression, you could write a streaming loader that emits an async generator of arrays along the third/last dimension. If the array size of the block were 10, 10, 1000, then maybe the generator would emit arrays of 10, 10, 10. Then it would be possible to work with existing data with a large block size as long as the application knows how to handle this streaming array data. (Though I'm not sure if some common codecs like blosc support streaming decompression, so this might be moot).

What is an optimal chunk size for deck.gl?

deck.gl's own requirements are set by the processing speed of the client and the amount of GPU memory it has. Handling at least a couple 100MB blocks at a time should be fine for deck.gl I think the optimal chunk size is more driven by network download time. If you had a Zarr store on fast network attached storage, having 100MB block sizes would be fine; for general internet access you'd probably want smaller block sizes.

Block size also matters for how many blocks you want to display at once. Is your preference for example to tile the entire screen or show a smaller area over a larger time horizon? For example the MUR SST dataset on AWS comes in block sizes:

time: 6443, lat: 100, lon: 100
time: 5, lat: 1799, lon: 3600

You could envision preferring no. 2 at low zooms where you care more about seeing the entire globe and no. 1 at higher zooms where you display a single block at a time, but care more about the animation over time.

ibgreen commented 3 years ago

@point9repeating Your PoC looks very promising, and Zarr support in loaders.gl + deck.gl makes a lot of sense.

Are you willing to share the code so we can start digging in to the details of how this could be done in a general way?

Perhaps @kylebarron and myself could help you set up a quick proxy service to get around the CORS issue?

zflamig commented 3 years ago

Hi @point9repeating the CORS on the HRRR Zarr bucket have been adjusted so please try it now.

4/29 update: MUR CORS now also support this use case

point9repeating commented 3 years ago

@zflamig I just saw this. Thank you so much!

FYI, it looks like mur-sst isn't set up for static website hosting: http://mur-sst.s3-website-us-west-2.amazonaws.com/zarr-v1/.zmetadata

point9repeating commented 3 years ago

And, it turns out HRRR zarr is stored as half-float arrays [<f2], which isn't compatible with zarr.js because there isn't a native TypedArray in javascript that maps to half-floats (we have Float32Array and Float64Array).

I did a quick attempt at updating zarr.js using this javascript implementation for a Float16Array, but zarr.js is written in typescript and it wasn't trivial to add Float16Array (different base type).

zflamig commented 3 years ago

Do you need the website endpoint for this @point9repeating ? You should be able to just use https://mur-sst.s3.amazonaws.com/zarr-v1/.zmetadata and have it work the same I would think.

point9repeating commented 3 years ago

That endpoint works great, @zflamig

I didn't realize you could enable CORS without enabling the static website hosting.

Wow. MUR is big. It looks like pulling the full global domain for a single time step will mean requesting 100 chunks that are ~40MB each.

rabernat commented 3 years ago

It looks like pulling the full global domain for a single time step will mean requesting 100 chunks that are ~40MB each.

This is one reason why we would really like to explicitly support image pyramids in Zarr. (@joshmoore / @manzt and company do have some microscopy datasets that use image pyramids, but afaik there is no standard / convention.)

manzt commented 3 years ago

Microscopy community has started to unify around standard / convention:

(think @joshmoore will be talking about this at Dask Summit?)

Some sample datasets from the Image Data Resource can be found here, all implementing the Zarr multiscales extension. Visualized in the browser using a combination of Zarr.js & deck.gl.

point9repeating commented 3 years ago

@manzt wow. this is so rad

kylebarron commented 3 years ago

In order to work with existing Zarr stores with a large block size, you could also take a more server-side approach where you write something like a rio-tiler adapter for Zarr, and then connect to a dynamic tiling server like Titiler. But there are clearly some drawbacks to that approach, and it isn't as scalable as directly fetching data from blocks on S3.

rabernat commented 3 years ago

In order to work with existing Zarr stores with a large block size, you could also take a more server-side approach

Big 👍 to this idea. Dynamic rechunking is definitely needed in the Zarr ecosystem. Simple server-side rechunking should be possible with xpublish.

For testing / demonstration, it would also be easy to create a static Zarr dataset that is optimally chunked for visualization (rather than analysis).

kylebarron commented 3 years ago

Dynamic rechunking is definitely needed in the Zarr ecosystem. Simple server-side rechunking should be possible with xpublish.

This is straying a bit from loaders.gl, but I wanted to add a couple notes here.

I think https://github.com/developmentseed/titiler is becoming a pretty popular project for serving geospatial raster assets on the fly, and think it could work well with Zarr too. The easiest way to set that up would be to make a new rio-tiler reader, like the COGReader class. Happy to discuss this more, maybe on an issue there?

You can imagine two Zarr adapters in titiler: one to read Zarr collections just like it reads GDAL datasets and another to expose an API with a "virtual" Zarr collection that's rechunked on demand. Then a ZarrLoader in loaders.gl could connect to that rechunked collection through the server.

About the rendering, seems like most Zarr geospatial datasets are in a global WGS84 projection? Deck.gl supports rendering lat/lon and Web Mercator tiles natively, but for data in any other projection, tiles would need to be reprojected at some stage in the process. Note that the TileLayer doesn't currently support non-Web Mercator indexing. I'd love to give advice to someone if they're interested in making a PR for the TileLayer to support arbitrary indexing systems (and also see https://github.com/visgl/deck.gl/pull/5504)

joshmoore commented 3 years ago

(think @joshmoore will be talking about this at Dask Summit?)

"Talking" is a bit much. I'll be annoying people with "multiscales" during the Life Sciences workshop but happy to discuss elsewhen, too. Short-short pitch, as @rabernat and @manzt know, I'd very much like more libraries to adopt the same strategy of defining multiscale/multiresolution images. It just makes life so much simpler. (Even just in microscopy we had N different formats which is where NGFF started unifying)

manzt commented 3 years ago

I absolutely love the idea of dynamic re-chunking, and it's something I've been experimenting with myself. It's also easy to imagine swapping compression on the server, e.g. from lossless to lossy.

It just makes life so much simpler.

+1 to this. Thinking of Zarr as an API rather than file format, a static Zarr dataset in cloud storage is indistinguishable on the client from one created dynamically on the server. The multiscale extension for Zarr essentially describes the endpoints a tile server would be responsible for implementing, that the Zarr client will ask for. Changing chunk-size, compression, etc, can all be expressed by changing the array metadata. If something like titiler adopted the multiscales extension, that would be very exciting.

IMO the ZarrLoader should be completely agnostic to the backend if possible. "Improving" a Zarr dataset for visualization can all be performed on the server and communicated in Zarr metadata. By default, it would be nice if the loader recognized the multiscales extension, but I could see an option where explicit urls for separate ZarrArrays (in the pyramid) is an option.

e.g. "https://my-multicales-dataset.zarr" vs ["https://my-multicales-dataset.zarr/0", "https://my-multicales-dataset.zarr/1", "https://my-multicales-dataset.zarr/2"]

kylebarron commented 3 years ago

swapping compression on the server, e.g. from lossless to lossy.

Aside: could be interesting to test out LERC with some Zarr data. Seems a good candidate for compressing data to bring to the browser where you have some defined precision. Only see one mention of Zarr + LERC though.

If something like titiler adopted the multiscales extension, that would be very exciting.

Titiler doesn't currently expose a Zarr API, but a Zarr extension is something we could discuss on that repo.

IMO the ZarrLoader should be completely agnostic to the backend if possible

Agreed. Doesn't seem like a difficult requirement; don't see why a ZarrLoader would even need to know if the Zarr dataset is static or dynamic.

manzt commented 3 years ago

Agreed. Doesn't seem like a difficult requirement; don't see why a ZarrLoader would even need to know if the Zarr dataset is static or dynamic.

Totally agree. I've just noticed that xpublish and other tools may introduces additional REST API endpoints, beyond chunk/metadata keys, and I'd like to avoid relying on any custom endpoints in a loader implementation.

rabernat commented 3 years ago

Xpublish has extra endpoints for convenience, but clients don't have to use them. To the client, the data are indistinguishable from static zarr files served over http.

kylebarron commented 3 years ago

Quick mention here that we're discussing with @manzt creating an initial Zarr loader as part of #1441 from what he already worked on as part of the Viv project.

kylebarron commented 3 years ago

I was also just made aware (thanks @vincentsarago) that a GDAL driver for Zarr is progressing in https://github.com/OSGeo/gdal/pull/3896. We should keep tabs on that to make sure the ZarrLoader here can read that data seamlessly.

visgl / loaders.gl

Zarr loader #1297