Zarr store access from ScienceBase

theobarnhart-USGS commented 1 year ago

@rsignell-usgs suggested I ask this question here. I'm working on a data release on ScienceBase that will be a collection of zarr stores and I would like to access them via an Intake catalog. I am having a hard time getting the Intake catalog to work.

I zipped my test zarr store without compression to allow it to work with the ScienceBase uploader, moved it from Caldera to ScienceBase via Globus, then published it to my repository's public S3 bucket. I then configured my Intake catalog (3rd entry, cc90-monthly-ziptest-cloud) to hit the store, which I think is where my problem is, but maybe it is somewhere else:

plugins:
    source:
        - module: intatke_xarray
sources:
    cc30-monthly-onprem: # this works!
        driver: zarr
        description: 'Crown of the Continent monthly 30 m SnowModel Simulation on Caldera'
        args:
            urlpath: '/caldera/hytest_scratch/scratch/tbarnhart/ccsp_zarr/CC30_*_monthly.zarr'

    cc90-monthly-ziptest: # this works!
        driver: zarr
        description: 'Crown of the Continent monthly 90 m SnowModel Simulation SWED and RPRE zip zarr on Caldera'
        args:
            urlpath: '/caldera/hytest_scratch/scratch/tbarnhart/ccsp_zarr/CC90/CC90_*_monthly.zarr.zip'

    cc90-monthly-ziptest-cloud: # s3 example, DOES NOT WORK
        driver: zarr
        description: 'Crown of the Continent monthly 90 m SnowModel Simulation SWED zarr on cloud'
        args:
            urlpath: 's3://prod-is-usgs-sb-prod-publish/6400b6dad34edc0ffaf4ef77/CC90_swed_monthly_test.zarr.zip'
            storage_options:
                anon: True
            consolidated: False

    cc90-monthly-ziptest-httpscloud: # does not work
        driver: zarr
        description: 'Crown of the Continent monthly 90 m SnowModel Simulation SWED on cloud via HTTPS'
        args:
            urlpath: 'https://prod-is-usgs-sb-prod-publish.s3.amazonaws.com/6400b6dad34edc0ffaf4ef77/CC90_swed_monthly.zarr.zip'

Here is the error message I get when I try and access cc90-monthly-ziptest-cloud:

I went down some rabbit holes trying to solve the access issue, it might be due to a permissions issue within the zarr zip store, which I tried to solve by writing it out using fsspec and acl = 'public-read' (https://github.com/pydata/xarray/issues/5918#issuecomment-961211401) and then zipping the zarr, but that did not change the error. My Intake catalog works when the zipstore is on Caldera (cc90-monthly-ziptest), but not after I put it on ScienceBase (cc90-monthly-ziptest-cloud).

Any suggestions or examples to follow would be greatly appreciated. Thank you! It would be really awesome to publish these data in an easily accessible format so that we don't have to duplicate efforts (and 3.5 TB) if cooperators want a web application down the line that uses these data.

alaws-USGS commented 1 year ago

@theobarnhart-USGS my first thought with it being error 403 would be that it is an issue with the s3 bucket permissions https://stackoverflow.com/questions/60365089/clienterror-an-error-occurred-403-when-calling-the-headobject-operation-forb

Additionally, when I try to access the file using fsspec.get_mapper and xarray.open_zarr I get a permission denied as well.

theobarnhart-USGS commented 1 year ago

Thanks @alaws-USGS, I've asked ScienceBase if they can open up the bucket some more.

rsignell-usgs commented 1 year ago

@theobarnhart-USGS yes, I think you can request that the dataset be given public access. You will not be able to list it, but you should be able to read it. In other words, the following should work, returning some info about the file. But it doesn't:

import fsspec
fs = fsspec.filesystem('s3', anon=True)
url = 's3://prod-is-usgs-sb-prod-publish/6400b6dad34edc0ffaf4ef77/CC90_swed_monthly_test.zarr.zip'
fs.info(url)

returns

PermissionError: Forbidden

Here's what it should look like:

url = 's3://prod-is-usgs-sb-prod-publish/618e83cad34ec04fc9caa715/South_Carolina_CoNED_Topobathy_DEM_1m.tif'
fs.info(url)

returns:

fs.info(url)

{'ETag': '"d6cdcb0bfa78956d660bdb75cb71c521-905"',
 'LastModified': datetime.datetime(2022, 3, 17, 18, 25, 35, tzinfo=tzutc()),
 'size': 94806684071,
 'name': 'prod-is-usgs-sb-prod-publish/618e83cad34ec04fc9caa715/South_Carolina_CoNED_Topobathy_DEM_1m.tif',
 'type': 'file',
 'StorageClass': 'STANDARD',
 'VersionId': None,
 'ContentType': 'image/tiff'}

I didn't know you could access a zarr dataset that has been zipped. I've never done that. Pretty cool if that works!

theobarnhart-USGS commented 1 year ago

Thanks @rsignell-usgs, I can get file info for my test file (although, this made me realize that I had forgotten to publish that particular file, now fixed). I still cannot open it, though. I can open that tiff you linked. I wonder if there is an issue with a zarr zipstore in particular. Does Hytest happen to have a very public bucket that I could try one of those in? I haven't been able to get a more traditional zarr store into ScienceBase b/c their system does not allow directories to be uploaded.

alaws-USGS commented 1 year ago

@theobarnhart-USGS Have you tried using xr.open_zarr(fsspec.get_mapper(zarr_url))?

theobarnhart-USGS commented 1 year ago

@alaws-USGS, yes, that yielded the same error:

rsignell-usgs commented 1 year ago

I know you have to ask SB to make data stored on S3 public. They are not public by default.

I know this because a few weeks ago Matt Cushing had two tiff files of bathymetry published on SB, one for South Carolina and one for North Carolina. We could access the South Carolina one without credentials, but not the North Carolina one.

He made a request to Drew Ignizio and later that day we could access the North Carolina one as well.

theobarnhart-USGS commented 1 year ago

Hi @rsignell-usgs and @alaws-USGS,

ScienceBase is still looking into the repo. All of the files are listed as being on the public side of my repo bucket, though and I can download them without credentials. I wonder if there is something different about a zarr or netCDF file in S3 and a tiff. I've had very good luck working with cloud optimized geotiffs on the sciencebase public cloud.

I did a kerchunk test this morning in my bucket and could not get that to work either.

theobarnhart-USGS commented 1 year ago

I met with the ScienceBase team and it sounds like if Zarr requires listing files inside the Zarr Store (be it a zip store or a directory store) this will likely not work because they cannot allow listing on just one bucket, it would have to be for all of ScienceBase and that is something they don't want to do for a variety of reasons.

I took a look through the Zarr code and it does look like there is some listing of directory stores and that there are slower ways to access the stores without listing, but I'm not sure how to disable listing. https://zarr.readthedocs.io/en/stable/api/storage.html#zarr.storage.listdir

This Hytest dataset seems most similar to what I am trying to accomplish on ScienceBase S3.

conus404-hourly-osn:
    driver: zarr
    description: "CONUS404 Hydro Variable subset, 40 years of hourly values. These files were created wrfout model output files (see ScienceBase data release for more details: https://www.sciencebase.gov/catalog/item/6372cd09d34ed907bf6c6ab1). This dataset is stored on AWS S3 cloud storage in a requester-pays bucket. You can work with this data for free in any environment (there are no egress fees)."
    args:
      urlpath: 's3://rsignellbucket2/hytest/conus404/conus404_hourly_202302.zarr'
      consolidated: true
      storage_options:
        anon: true
        requester_pays: false
        client_kwargs:
          endpoint_url: https://renc.osn.xsede.org

@rsignell-usgs, would it be possible to try one of my Zarr stores in your bucket to see if the issue is the configuration of ScienceBase S3?

hytest-org / hytest

Zarr store access from ScienceBase #299