Open rabernat opened 3 years ago
Thanks for pinging me @rabernat. Hoping I can share lessons learned from zarr.js & Viv.
Hey @rabernat, happy to see you in this neck of the Github woods!
I think Zarr is a good fit for a new loader. I would expect it would be a thin wrapper around zarr.js (or maybe zarr-lite). Ideally we'll have a two-step loader so that the first step can instantiate the store and read the metadata (once) and the second step will load a single chunk, like const chunk = await z.getRawChunk('0.0.0')
.
Questions:
zarr-lite/core
provides, for a lower bundle size?@ibgreen likely has some feedback, but he may be slow to respond the next few days.
Hello!
I spent a little time getting reacquainted with deck.gl / luma.gl (it's been a couple years since I've used these libraries) and was able to throw together a proof of concept rendering a geospatial zarr data set using deck.gl + zarr.js:
This data set is the NCEP NAM forecast I had available locally [2m relative humidity with a simple red/blue color bar].
I thought it might be useful to throw up a version of this pointed at an open cloud-hosted data set to help us discuss any potential additions to luma.gl or deck.gl. Unfortunately, I'm struggling to find a data set that has a CORS-enabled http front-end. I tried MUR and HRRR.
It looks like the HRRR bucket is configured with static website hosting: http://hrrrzarr.s3-website-us-west-1.amazonaws.com/sfc/20210414/20210414_00z_fcst.zarr/surface/TMP/projection_y_coordinate/.zarray
but CORS has not been configured to allow requests from other origins.
One issue (noted in https://github.com/zarr-developers/community/issues/37) is that many of the existing cloud-based Zarr data is optimized for backend analytics and consequently has chunk sizes of ~100 MB. This is probably much too big for interactive browser-based visualization.
What is an optimal chunk size for deck.gl? I'll try to prepare some Zarr data with much smaller chunks and set up an appropriate CORS policy.
This is probably much too big for interactive browser-based visualization.
One question is whether the compression algorithm applied to each block supports streaming decompression. For example if you use gzip compression, you could write a streaming loader that emits an async generator of arrays along the third/last dimension. If the array size of the block were 10, 10, 1000, then maybe the generator would emit arrays of 10, 10, 10. Then it would be possible to work with existing data with a large block size as long as the application knows how to handle this streaming array data. (Though I'm not sure if some common codecs like blosc support streaming decompression, so this might be moot).
What is an optimal chunk size for deck.gl?
deck.gl's own requirements are set by the processing speed of the client and the amount of GPU memory it has. Handling at least a couple 100MB blocks at a time should be fine for deck.gl I think the optimal chunk size is more driven by network download time. If you had a Zarr store on fast network attached storage, having 100MB block sizes would be fine; for general internet access you'd probably want smaller block sizes.
Block size also matters for how many blocks you want to display at once. Is your preference for example to tile the entire screen or show a smaller area over a larger time horizon? For example the MUR SST dataset on AWS comes in block sizes:
You could envision preferring no. 2 at low zooms where you care more about seeing the entire globe and no. 1 at higher zooms where you display a single block at a time, but care more about the animation over time.
@point9repeating Your PoC looks very promising, and Zarr support in loaders.gl + deck.gl makes a lot of sense.
Are you willing to share the code so we can start digging in to the details of how this could be done in a general way?
Perhaps @kylebarron and myself could help you set up a quick proxy service to get around the CORS issue?
Hi @point9repeating the CORS on the HRRR Zarr bucket have been adjusted so please try it now.
4/29 update: MUR CORS now also support this use case
@zflamig I just saw this. Thank you so much!
FYI, it looks like mur-sst isn't set up for static website hosting: http://mur-sst.s3-website-us-west-2.amazonaws.com/zarr-v1/.zmetadata
And, it turns out HRRR zarr is stored as half-float arrays [<f2], which isn't compatible with zarr.js because there isn't a native TypedArray in javascript that maps to half-floats (we have Float32Array and Float64Array).
I did a quick attempt at updating zarr.js using this javascript implementation for a Float16Array, but zarr.js is written in typescript and it wasn't trivial to add Float16Array (different base type).
Do you need the website endpoint for this @point9repeating ? You should be able to just use https://mur-sst.s3.amazonaws.com/zarr-v1/.zmetadata and have it work the same I would think.
That endpoint works great, @zflamig
I didn't realize you could enable CORS without enabling the static website hosting.
Wow. MUR is big. It looks like pulling the full global domain for a single time step will mean requesting 100 chunks that are ~40MB each.
It looks like pulling the full global domain for a single time step will mean requesting 100 chunks that are ~40MB each.
This is one reason why we would really like to explicitly support image pyramids in Zarr. (@joshmoore / @manzt and company do have some microscopy datasets that use image pyramids, but afaik there is no standard / convention.)
Microscopy community has started to unify around standard / convention:
(think @joshmoore will be talking about this at Dask Summit?)
Some sample datasets from the Image Data Resource can be found here, all implementing the Zarr multiscales extension. Visualized in the browser using a combination of Zarr.js & deck.gl.
@manzt wow. this is so rad
In order to work with existing Zarr stores with a large block size, you could also take a more server-side approach where you write something like a rio-tiler adapter for Zarr, and then connect to a dynamic tiling server like Titiler. But there are clearly some drawbacks to that approach, and it isn't as scalable as directly fetching data from blocks on S3.
In order to work with existing Zarr stores with a large block size, you could also take a more server-side approach
Big 👍 to this idea. Dynamic rechunking is definitely needed in the Zarr ecosystem. Simple server-side rechunking should be possible with xpublish.
For testing / demonstration, it would also be easy to create a static Zarr dataset that is optimally chunked for visualization (rather than analysis).
Dynamic rechunking is definitely needed in the Zarr ecosystem. Simple server-side rechunking should be possible with xpublish.
This is straying a bit from loaders.gl, but I wanted to add a couple notes here.
I think https://github.com/developmentseed/titiler is becoming a pretty popular project for serving geospatial raster assets on the fly, and think it could work well with Zarr too. The easiest way to set that up would be to make a new rio-tiler
reader, like the COGReader
class. Happy to discuss this more, maybe on an issue there?
You can imagine two Zarr adapters in titiler: one to read Zarr collections just like it reads GDAL datasets and another to expose an API with a "virtual" Zarr collection that's rechunked on demand. Then a ZarrLoader in loaders.gl could connect to that rechunked collection through the server.
About the rendering, seems like most Zarr geospatial datasets are in a global WGS84 projection? Deck.gl supports rendering lat/lon and Web Mercator tiles natively, but for data in any other projection, tiles would need to be reprojected at some stage in the process. Note that the TileLayer doesn't currently support non-Web Mercator indexing. I'd love to give advice to someone if they're interested in making a PR for the TileLayer to support arbitrary indexing systems (and also see https://github.com/visgl/deck.gl/pull/5504)
(think @joshmoore will be talking about this at Dask Summit?)
"Talking" is a bit much. I'll be annoying people with "multiscales" during the Life Sciences workshop but happy to discuss elsewhen, too. Short-short pitch, as @rabernat and @manzt know, I'd very much like more libraries to adopt the same strategy of defining multiscale/multiresolution images. It just makes life so much simpler. (Even just in microscopy we had N different formats which is where NGFF started unifying)
I absolutely love the idea of dynamic re-chunking, and it's something I've been experimenting with myself. It's also easy to imagine swapping compression on the server, e.g. from lossless to lossy.
It just makes life so much simpler.
+1 to this. Thinking of Zarr as an API rather than file format, a static Zarr dataset in cloud storage is indistinguishable on the client from one created dynamically on the server. The multiscale extension for Zarr essentially describes the endpoints a tile server would be responsible for implementing, that the Zarr client will ask for. Changing chunk-size, compression, etc, can all be expressed by changing the array metadata. If something like titiler
adopted the multiscales extension, that would be very exciting.
IMO the ZarrLoader
should be completely agnostic to the backend if possible. "Improving" a Zarr dataset for visualization can all be performed on the server and communicated in Zarr metadata. By default, it would be nice if the loader recognized the multiscales extension, but I could see an option where explicit urls for separate ZarrArrays (in the pyramid) is an option.
e.g. "https://my-multicales-dataset.zarr"
vs ["https://my-multicales-dataset.zarr/0", "https://my-multicales-dataset.zarr/1", "https://my-multicales-dataset.zarr/2"]
swapping compression on the server, e.g. from lossless to lossy.
Aside: could be interesting to test out LERC with some Zarr data. Seems a good candidate for compressing data to bring to the browser where you have some defined precision. Only see one mention of Zarr + LERC though.
If something like
titiler
adopted the multiscales extension, that would be very exciting.
Titiler doesn't currently expose a Zarr API, but a Zarr extension is something we could discuss on that repo.
IMO the
ZarrLoader
should be completely agnostic to the backend if possible
Agreed. Doesn't seem like a difficult requirement; don't see why a ZarrLoader
would even need to know if the Zarr dataset is static or dynamic.
Agreed. Doesn't seem like a difficult requirement; don't see why a
ZarrLoader
would even need to know if the Zarr dataset is static or dynamic.
Totally agree. I've just noticed that xpublish
and other tools may introduces additional REST API endpoints, beyond chunk/metadata keys, and I'd like to avoid relying on any custom endpoints in a loader implementation.
Xpublish has extra endpoints for convenience, but clients don't have to use them. To the client, the data are indistinguishable from static zarr files served over http.
Quick mention here that we're discussing with @manzt creating an initial Zarr loader as part of #1441 from what he already worked on as part of the Viv project.
I was also just made aware (thanks @vincentsarago) that a GDAL driver for Zarr is progressing in https://github.com/OSGeo/gdal/pull/3896. We should keep tabs on that to make sure the ZarrLoader here can read that data seamlessly.
Hello and thanks for all of your work on this incredible open source package and ecosystem!
At our Pangeo Community meeting today, we discussed our wish to integrate cloud-based weather and climate data stored in the Zarr format with the deck.gl ecosystem. (I noticed this has been discussed before in #1140.) There is a Zarr reader in javascript (zarr.js), so maybe that makes it easier. I understand that #1140 probably has to be resolved to make this possible, but I thought I'd just open a dedicated issue to track the idea.
Tagging @kylebarron, @point9repeating and @manzt who may be interested.