keller-mark / pizzarr

Slice into Zarr arrays in R 🍕
https://keller-mark.github.io/pizzarr/
MIT License
25 stars 2 forks source link

listdir for http store #79

Closed dblodgett-usgs closed 1 month ago

dblodgett-usgs commented 1 month ago

I'm still learning the ins and outs of the zarr spec, and still don't totally follow the nuances of consolidated metadata, but was playing with this. Wondering what your preferred implementation is @keller-mark ?

User story

I want to be able to listdir() with an http store.

Preferred solution

Looking here: https://github.com/zarr-developers/zarr-python/issues/993 and doing a bit of background, it appears that listing an http store only typically works when there is consolidated metadata. Should we follow the same pattern?

Looking at the python implementation:

import zarr
import fsspec
import xarray

# this does not have .zmetadata
url = "https://uk1s3.embassy.ebi.ac.uk/idr/zarr/v0.4/idr0062A/6001240.zarr"

store = zarr.storage.FSStore(url)

fs = fsspec.implementations.http.HTTPFileSystem()
fs.isdir(url) # -> False

data = fs.get_mapper(url)['.zmetadata'] # -- Error
# FileNotFoundError: https://uk1s3.embassy.ebi.ac.uk/idr/zarr/v0.4/idr0062A/6001240.zarr/.zmetadata

xarray.open_zarr(url, consolidated=False) -> returns attributes for multiscales and omero but no data variables (which is expected from xarray with this dataset)

url = "https://usgs.osn.mghpcc.org/mdmf/gdp/hawaii_present.zarr/"

store = zarr.storage.FSStore(url)

fs = fsspec.implementations.http.HTTPFileSystem()
fs.isdir(url) # -> True

data = fs.get_mapper(url)['.zmetadata']

xarray.open_zarr("https://usgs.osn.mghpcc.org/mdmf/gdp/hawaii_present.zarr", consolidated=True) -> returns a nice xarray dataset

xarray.open_zarr("https://usgs.osn.mghpcc.org/mdmf/gdp/hawaii_present.zarr", consolidated=False) -> returns just attributes and no data variables.

And if I set up a little local file server that includes an index.html, with:

# https://github.com/DOI-USGS/rnz
servr::httd(dirname(rnz::z_demo()))

I can do this and the httpstore hs isdir TRUE! Not sure we should be reading an html index and doing what fsspec does here -- that is a lot of complexity.

import zarr
import fsspec

url = "http://127.0.0.1:4321/bcsd_obs_1999.zarr/"

store = zarr.storage.FSStore(url)

fs = fsspec.implementations.http.HTTPFileSystem()
fs.isdir(url) # -> True

xarray.open_zarr("http://127.0.0.1:4321/bcsd_obs_1999.zarr/", consolidated=True) # -> gives a nice xarray dataset
xarray.open_zarr("http://127.0.0.1:4321/bcsd_obs_1999.zarr/", consolidated=False) # -> gives an error that I don't follow about ".." in the path.
dblodgett-usgs commented 1 month ago

Found the code in question.

https://github.com/fsspec/filesystem_spec/blob/master/fsspec/implementations/http.py#L40

dblodgett-usgs commented 1 month ago

See WIP for this here: https://github.com/dblodgett-usgs/pizzarr/pull/1/files

How do you feel about use of memoise as I have in there?

one_hour <- 60^2
# per-session http get cache
mem_get <- memoise::memoise(\(client, path) client$get(path), 
                            ~memoise::timeout(one_hour))

I added vcr testing for crul interactions. Seems the most straight ahead for mocking http tests right now. Any issue with how I set that up?

(simultaneous comments FTW! -- happy monday)

keller-mark commented 1 month ago

Wondering what your preferred implementation is

My first thought would be to add an optional property to the Store abstract class for keeping consolidated metadata at the store-level. If the consolidated metadata is present, then it can be used within the listdir function implementation.

it appears that listing an http store only typically works when there is consolidated metadata. Should we follow the same pattern?

Yes, I think it is reasonable to say that listdir support is only guaranteed if consolidated metadata is present. I actually did not even realize that Zarr would ever look in index.html directory listings. (In most of the S3 buckets that I use day-to-day the directory listing is turned off)

keller-mark commented 1 month ago

Haha, happy Monday! Thanks for sharing the implementation.

How do you feel about use of memoise as I have in there?

It makes sense, especially for HTTP store! I think we should add (and document) a mechanism to clear the cache.

I also wonder if the cache should be a property on the store instance instead of a global variable. For more flexibility, the constructor could offer options to specify a timeout, the option to out-out of caching, and/or a custom cache instance to pass to the cache parameter of memoise

dblodgett-usgs commented 1 month ago

+1 on the cache control on the store. I'm thinking of the global in-memory cache as kind of like a web browser cache but I can see how it would be nice to associate it with the store object and not all that hard -- I've been banging my head on caching in one of my other package. Hopefully I can offer some hard lessons learned here.

+1 on requiring consolidated metadata.

Out of curiosity, is the "multiresolutions" convention you have in your demos something there is a specification for? the xarray zarr zonvention doesn't have any metadata other than fully consolidated metadata that gets you from group to arrays.

dblodgett-usgs commented 1 month ago

OK -- I just moved the memoise function under the store object and added a cache timeout control.

https://github.com/dblodgett-usgs/pizzarr/pull/1/files#diff-5dcc06646e43f4e7ef995b41f4923c52fe3e3cc6f3f405fdaa3864c1d2b0718dR344

Once #81 has merged, I'll queue that up to this repository.