intake / intake-xarray

Intake plugin for xarray
https://intake-xarray.readthedocs.io/
BSD 2-Clause "Simplified" License
74 stars 36 forks source link

fix for zarr.ZipStore #66

Closed raphaeldussin closed 4 years ago

raphaeldussin commented 4 years ago

intake-xarray is presently not able to work with zarr.ZipStore. The issue seems to be coming from the fsspec mapping. From what I understood of fsspec mappings, I am not sure if it is able to handle local zip files or if it's desirable. The proposed fix adds an option to bypass the mapper and use the local path to the file.

problem and solution can be tested with:

cat catalog.yml

plugins:
  source:
    - module: intake_xarray
sources:
  temp_directory:
    description: zarr directory
    driver: zarr
    args:
      urlpath: 'temp_dir'
      consolidated: False
    metadata:
      origin_url: ''
  temp_directory_consolidated:
    description: zarr directory consolidated
    driver: zarr
    args:
      urlpath: 'temp_dirc'
      consolidated: True
    metadata:
      origin_url: ''
  temp_zip:
    description: zarr zipstore
    driver: zarr
    args:
      urlpath: 'temp_zip.zip'
      consolidated: False
    metadata:
      origin_url: ''
  temp_zip_consolidated:
    description: zarr zipstore consolidated
    driver: zarr
    args:
      urlpath: 'temp_zipc.zip'
      consolidated: True
    metadata:
      origin_url: ''
  temp_zip_fix:
    description: zarr zipstore
    driver: zarr
    args:
      urlpath: 'temp_zip.zip'
      consolidated: False
      force_local: True
    metadata:
      origin_url: ''
  temp_zip_consolidated_fix:
    description: zarr zipstore consolidated
    driver: zarr
    args:
      urlpath: 'temp_zipc.zip'
      force_local: True
      consolidated: True
    metadata:
      origin_url: ''
import xarray as xr
import numpy as np
import zarr

ds = xr.Dataset()
ds['temp'] = xr.DataArray(data=np.random.rand(180, 360), dims=('lat', 'lon'))
ds['lon'] = xr.DataArray(data=np.arange(360), dims=('lon'))
ds['lat'] = xr.DataArray(data=np.arange(-90,90), dims=('lat'))

store = zarr.DirectoryStore('temp_dirc')
ds.to_zarr(store, consolidated=True, mode='w')

store = zarr.DirectoryStore('temp_dir')
ds.to_zarr(store, consolidated=False, mode='w')

store = zarr.ZipStore('temp_zipc.zip', mode='w')
ds.to_zarr(store, consolidated=True, mode='w')
store.close()

store = zarr.ZipStore('temp_zip.zip', mode='w')
ds.to_zarr(store, consolidated=False, mode='w')
store.close()

import intake

cat = intake.open_catalog('catalog.yml')

ds1 = cat.temp_directory.to_dask()
print(ds1)

ds2 = cat.temp_directory_consolidated.to_dask()
print(ds2)

try:
    ds3 = cat.temp_zip.to_dask()
    print(ds3)
except:
    print('failed to open entry temp_zip')

try:
    ds4 = cat.temp_zip_consolidated.to_dask()
    print(ds4)
except:
    print('failed to open entry temp_zip_consolidated')

ds5 = cat.temp_zip_fix.to_dask()
print(ds5)

ds6 = cat.temp_zip_consolidated_fix.to_dask()
print(ds6)
martindurant commented 4 years ago

Using fsspec directly does work, although the incantation s a little strange:

  temp_zip:
    description: zarr zipstore
    driver: zarr
    args:
      urlpath: 'zip://'
      storage_options:
        fo: 'temp_zip.zip'
      consolidated: False

Just like with fsspec.open, fsspec.get_mapper ought to support compound URLs, in this case "zip::temp_zip.zip" (and no extra storage options), but this is not yet done.

raphaeldussin commented 4 years ago

Hi @martindurant and thanks for the help. I'm trying this out rn and I can make it work for consolidated ZipStore but not un-consolidated. Any idea about what the problem might be:

  temp_zip_martin:
    description: zarr zipstore
    driver: zarr
    args:
      urlpath: 'zip://'
      storage_options:
        fo: 'temp_zip.zip'
      consolidated: False
    metadata:
      origin_url: ''

the error is:

KeyError: "There is no item named '.zattrs/.zarray' in the archive"

my hunch is that it is blind to the difference between directories and files and expect to find a .zarray under .zattrs which is just a file.

what do you think?

martindurant commented 4 years ago

The exact same thing did work for me, unconsolidated. Are you on windows by any chance?

raphaeldussin commented 4 years ago

my config is:

OSX Catalina anaconda python 3.6.7 intake v 0.5.5 fsspec v 0.6.2

martindurant commented 4 years ago

Would you mind trying with fsspec master (or indeed the chained_mapper branch)?

raphaeldussin commented 4 years ago

I've tried both and can't get it to work either. weird...

can you post your test?

martindurant commented 4 years ago
!cat catalog.yml
sources:
  temp_zip:
    description: zarr zipstore
    driver: zarr
    args:
      urlpath: 'zip://'
      storage_options:
        fo: temp_zip.zip
      consolidated: False
    metadata:
      origin_url: ''

ds = xr.Dataset()
ds['temp'] = xr.DataArray(data=np.random.rand(180, 360), dims=('lat', 'lon'))
ds['lon'] = xr.DataArray(data=np.arange(360), dims=('lon'))
ds['lat'] = xr.DataArray(data=np.arange(-90,90), dims=('lat'))

store = zarr.ZipStore('temp_zip.zip', mode='w')
ds.to_zarr(store, consolidated=False, mode='w')
store.close().  # seems to require kernel restart here, write is incomplete

cat = intake.open_catalog('catalog.yml')
cat.temp_zip.to_dask()
raphaeldussin commented 4 years ago

great! I'm gonna call docker to the rescue to make sure it's not due to something in my config. will get back to you soon

martindurant commented 4 years ago

Perhaps if you show me how you are calling docker, I can follow it. Obviously, it would be good to test this kind of thing here, but fsspec should be released first. Are you aware of https://github.com/intake/intake-xarray/pull/66 ? That might make things more consistent.

raphaeldussin commented 4 years ago

@martindurant I'm going to create and point you to a github repo with my dockerfiles once I'm done with afternoon meetings. So far fsspec master works with the storage_option but your new branch (https://github.com/intake/filesystem_spec/pull/282) does not work with the new syntax.

martindurant commented 4 years ago

fsspec master works with the storage_option

This is progress!

raphaeldussin commented 4 years ago

@martindurant docker tests here: https://github.com/raphaeldussin/docker_tests_zipstore.git

martindurant commented 4 years ago

The chained version doesn't seem to install the right version

git+https://github.com/martindurant/filesystem_spec.git#chained_mapper

In my case, to test, I did extra steps to be certain

git clone https://github.com/martindurant/filesystem_spec
cd filesystem_spec
git checkout chained_mapper
pip3 install -e
cd ..
martindurant commented 4 years ago

(and it worked)

raphaeldussin commented 4 years ago

@ and # keys are too close for my own good... this works for me as well now.

the change to fsspec is enough for my application. This PR has no reason to be anymore. many thanks @martindurant