Open jkingslake opened 2 years ago
I think that this is not an issue with xarray, zarr, or anything in python world but rather an issue with how caching works on GCS public buckets: https://cloud.google.com/storage/docs/metadata
To test this, forget about xarray and zarr for a minute and just use gcsfs to list the bucket contents before and after your writes. I think you will find that the default cache lifetime of 3600 seconds means that you cannot "see" the changes to the bucket or the objects as quickly as needed in order to append.
Thanks for taking a look @rabernat.
The code below writes a new zarr and checks immediately if it's there using gcsfs. It seems to appear within a few seconds.
Is this what you meant?
%%time
import fsspec
import xarray as xr
import json
import gcsfs
# define a mapper to the ldeo-glaciology bucket. - needs a token
with open('../secrets/ldeo-glaciology-bc97b12df06b.json') as token_file:
token = json.load(token_file)
# get a mapper with fsspec for a new zarr
mapper = fsspec.get_mapper('gs://ldeo-glaciology/append_test/test11', mode='w', token=token)
# check what files are in there
fs = gcsfs.GCSFileSystem(project='pangeo-integration-te-3eea', mode='ab', cache_timeout = 0)
print('Files in the test directory before writing:')
filesBefore = fs.ls('gs://ldeo-glaciology/append_test/')
print(*filesBefore,sep='\n')
# define a simple datasets
ds0 = xr.Dataset({'temperature': (['time'], [50, 51, 52])}, coords={'time': [1, 2, 3]})
# write the simple dataset to zarr
ds0.to_zarr(mapper)
# check to see if the new file is there
print('Files in the test directory after writing:')
filesAfter = fs.ls('gs://ldeo-glaciology/append_test/')
print(*filesAfter,sep='\n')
Output:
Files in the test directory before writing:
ldeo-glaciology/append_test/test1
ldeo-glaciology/append_test/test10
ldeo-glaciology/append_test/test2
ldeo-glaciology/append_test/test3
ldeo-glaciology/append_test/test4
ldeo-glaciology/append_test/test5
ldeo-glaciology/append_test/test6
ldeo-glaciology/append_test/test7
ldeo-glaciology/append_test/test8
ldeo-glaciology/append_test/test9
Files in the test directory after writing:
ldeo-glaciology/append_test/test1
ldeo-glaciology/append_test/test10
ldeo-glaciology/append_test/test11
ldeo-glaciology/append_test/test2
ldeo-glaciology/append_test/test3
ldeo-glaciology/append_test/test4
ldeo-glaciology/append_test/test5
ldeo-glaciology/append_test/test6
ldeo-glaciology/append_test/test7
ldeo-glaciology/append_test/test8
ldeo-glaciology/append_test/test9
CPU times: user 130 ms, sys: 16.5 ms, total: 146 ms
Wall time: 2.19 s
Can you post the full stack trace of the error you get when you try to append?
Thanks for pointing out this cache feature @rabernat. I had no idea - makes sense in general but slows down testing if no known about! Anyway for my case, when appending the second Zarr store to the first, the Zarr's size (using gsutil du
) does indeed double. I'm new to cloud storage, but my hunch is that this suggests it was appended?
Can you post the full stack trace of the error you get when you try to append?
In my instance, there is no error, only this returned: <xarray.backends.zarr.ZarrStore at 0x7f662d31f3a0>
Ok I think I may understand what is happening
## load the zarr store
ds_both = xr.open_zarr(mapper)
When you do this, zarr reads a file called gs://ldeo-glaciology/append_test/test5/temperature/.zarray
. Since the data are public, I can look at it right now:
$ gsutil cat gs://ldeo-glaciology/append_tet/test5/temperature/.zarray
{
"chunks": [
3
],
"compressor": {
"blocksize": 0,
"clevel": 5,
"cname": "lz4",
"id": "blosc",
"shuffle": 1
},
"dtype": "<i8",
"fill_value": null,
"filters": null,
"order": "C",
"shape": [
6
],
}
Right now, it shows the shape is [6]
, as expected after the appending. However, if you read the file immediately after appending (within the 3600s max-age
), you will get the cached copy. The cached copy will still be of shape [3]
--it won't know about the append.
To test this hypothesis, you would need to disable caching on the bucket. Do you have privileges to do that?
Right now, it shows the shape is
[6]
, as expected after the appending. However, if you read the file immediately after appending (within the 3600smax-age
), you will get the cached copy. The cached copy will still be of shape[3]
--it won't know about the append.
Ignorant question: is this cache relevant to client (Jupyter) side or server (GCS) side? It has been well over 3600s and I'm still not seeing the appended zarr when reading it in using Xarray.
To test this hypothesis, you would need to disable caching on the bucket. Do you have privileges to do that?
I tried to do this last night but did not have permission myself. Perhaps @jkingslake does?
Update: my local notebook accessing the public bucket does see the appended zarr store exactly as expected, while the 2i2c-hosted notebook still is not (been well over 3600s).
Also, I do as @jkingslake does above and set the cache_timeout=0
. From GCSFs docs Set cache_timeout <= 0 for no caching,
seems like the functionality we desire, yet I continue to only see the un-appended zarr
thanks for looking into this.
To test this hypothesis, you would need to disable caching on the bucket. Do you have privileges to do that?
I tried to do this last night but did not have permission myself. Perhaps @jkingslake does?
@porterdf you should have full permissions to do things like this. But in any case, I could only see how to change metadata for individual existing objects rather than the entire bucket. How do I edit the cache-control for whole bucket?
I have tried writing the first dataset, then changing its disabling caching for that object, then appending. I still do not see the full length (shape = [6]) dataset when I reload it.
So there are two layers here where caching could be happening:
I propose we eliminate the python layer entirely for the moment. Whenever you load the dataset, it's shape is completely determined by whatever zarr sees in gs://ldeo-glaciology/append_test/test5/temperature/.zarray
. So try looking at this file directly. You can figure out its public URL and just do curl, e.g.
curl https://storage.googleapis.com/ldeo-glaciology/append_test/test5/temperature/.zarray
{
"chunks": [
3
],
"compressor": {
"blocksize": 0,
"clevel": 5,
"cname": "lz4",
"id": "blosc",
"shuffle": 1
},
"dtype": "<i8",
"fill_value": null,
"filters": null,
"order": "C",
"shape": [
6
],
"zarr_format": 2
}
Run this from jupyterhub from the command line. Then try gcs.cat('ldeo-glaciology/append_test/test5/temperature/.zarray'
and see if you see the same thing. Basically just eliminate as many layers as possible from the problem until you get to the core issue.
curl
, i get (shape [6])curl https://storage.googleapis.com/ldeo-glaciology/append_test/test30/temperature/.zarray
{
"chunks": [
3
],
"compressor": {
"blocksize": 0,
"clevel": 5,
"cname": "lz4",
"id": "blosc",
"shuffle": 1
},
"dtype": "<i8",
"fill_value": null,
"filters": null,
"order": "C",
"shape": [
6
],
"zarr_format": 2
gsutil
, i get (shape [6])gsutil cat gs://ldeo-glaciology/append_test/test30/temperature/.zarray
{
"chunks": [
3
],
"compressor": {
"blocksize": 0,
"clevel": 5,
"cname": "lz4",
"id": "blosc",
"shuffle": 1
},
"dtype": "<i8",
"fill_value": null,
"filters": null,
"order": "C",
"shape": [
6
],
"zarr_format": 2
import fsspec
import xarray as xr
import json
import gcsfs
mapper = fsspec.get_mapper('gs://ldeo-glaciology/append_test/test30', mode='r')
ds_both = xr.open_zarr(mapper)
len(ds_both.time)
3
gcsfs
in the jupyterhub I get (shape [3])gcs = gcsfs.GCSFileSystem(project='pangeo-integration-te-3eea')
gcs.cat('ldeo-glaciology/append_test/test5/temperature/.zarray')
b'{\n "chunks": [\n 3\n ],\n "compressor": {\n "blocksize": 0,\n "clevel": 5,\n "cname": "lz4",\n "id": "blosc",\n "shuffle": 1\n },\n "dtype": "<i8",\n "fill_value": null,\n "filters": null,\n "order": "C",\n "shape": [\n 3\n ],\n "zarr_format": 2\n}'
but I now am really confused because test5
from a few days ago shows up as shape [6]:
import fsspec
import xarray as xr
import json
import gcsfs
mapper = fsspec.get_mapper('gs://ldeo-glaciology/append_test/test5', mode='r')
ds_both = xr.open_zarr(mapper)
len(ds_both.time)
/tmp/ipykernel_1040/570416536.py:7: RuntimeWarning: Failed to open Zarr store with consolidated metadata, falling back to try reading non-consolidated metadata. This is typically much slower for opening a dataset. To silence this warning, consider:
1. Consolidating metadata in this existing store with zarr.consolidate_metadata().
2. Explicitly setting consolidated=False, to avoid trying to read consolidate metadata, or
3. Explicitly setting consolidated=True, to raise an error in this case instead of falling back to try reading non-consolidated metadata.
ds_both = xr.open_zarr(mapper)
6
@porterdf did you disable caching when you wrote the first zarr? How did you do that exactly?
Thought I would drop a related note here. Gcsfs just added support for fixed-key metadata: https://github.com/fsspec/gcsfs/pull/429. So if you are testing out different fsspec/gcsfs options for caching, make sure you are using gcsfs==2021.11.0
.
Coming back to this a year later, I am still having the same issue.
Running the gsutil locally
gsutil cat gs://ldeo-glaciology/append_test/test30/temperature/.zarray
shows shape 6:
{
"chunks": [
3
],
"compressor": {
"blocksize": 0,
"clevel": 5,
"cname": "lz4",
"id": "blosc",
"shuffle": 1
},
"dtype": "<i8",
"fill_value": null,
"filters": null,
"order": "C",
"shape": [
6
],
"zarr_format": 2
whereas running fsspec on leap-pangeo shows only shape 3:
import fsspec
import xarray as xr
import json
import gcsfs
mapper = fsspec.get_mapper('gs://ldeo-glaciology/append_test/test30', mode='r')
ds_both = xr.open_zarr(mapper)
len(ds_both.temperature)
And trying to append using a new toy dataset written from leap-pangeo has the same issue.
Any ideas on what to try next?
Your issue is that the consolidated metadata have not been updated:
import gcsfs
fs = gcsfs.GCSFileSystem()
# the latest array metadata
print(fs.cat('gs://ldeo-glaciology/append_test/test30/temperature/.zarray').decode())
# -> "shape": [ 6 ]
# the consolidated metadata
print(fs.cat(''gs://ldeo-glaciology/append_test/test30/.zmetadata'').decode())
# -> "shape": [ 3 ]
There are two ways to fix this.
ds = xr.open_dataset('gs://ldeo-glaciology/append_test/test30', engine='zarr', consolidated=False)
Thanks @rabernat.
Using consolidated=False
when reading seems to work, but not immediately after the append, and there is very strange behavior where the size of the dataset changes each time you read it. So maybe this is the cache issue again.
It appears from here that the default caching metadata on each object in a buckect overrides any argument you send when loading.
But following this
https://stackoverflow.com/questions/52499015/set-metadata-for-all-objects-in-a-bucket-in-google-cloud-storage
I can turn off caching for all objects in the bucket with
gsutil setmeta -h "Cache-Control:no-store" gs://ldeo-glaciology/**
But I don't think this affects new objects.
So when writing new objects that I want to append to, maybe the approach is to write the first one, then turn off caching for that object, then continue to append.
This is my latest attempt to avoid the cache issue. It is not working. But I wanted to document it here for the next time this comes up.
import fsspec
import xarray as xr
import json
import gcsfs
## define a mapper to the ldeo-glaciology bucket
### needs a token
with open('/Users/jkingslake/Documents/misc/ldeo-glaciology-bc97b12df06b.json') as token_file:
token = json.load(token_file)
filename = 'gs://ldeo-glaciology/append_test/test56'
mapper = fsspec.get_mapper(filename, mode='w', token=token)
## define two simple datasets
ds0 = xr.Dataset({'temperature': (['time'], [50, 51, 52])}, coords={'time': [1, 2, 3]})
ds1 = xr.Dataset({'temperature': (['time'], [53, 54, 55])}, coords={'time': [4, 5, 6]})
## write the first ds to bucket
ds0.to_zarr(mapper)
gsutil setmeta -h "Cache-Control:no-store" gs://ldeo-glaciology/append_test/test56/**
to turn off caching for this zarr store and all the files associated with it
## append the second ds to the same zarr store
ds1.to_zarr(mapper, mode='a', append_dim='time')
ds = xr.open_dataset('gs://ldeo-glaciology/append_test/test56', engine='zarr', consolidated=False)
len(ds.time)
3
At least it sometimes does this and sometimes work later, and sometimes works immediately.
What happened: Appending a toy dataset to an existing zarr store in GCS along the time dimension leaves the store unchanged.
What you expected to happen: The store to double in length, because I was appending a dataset with a length of 3 in the time dimension, to another dataset of the same size.
Minimal Complete Verifiable Example: To reproduce fully you will need the token, but maybe people can try using their own token.
Anything else we need to know?: It works as expected if you instead write and append to the pangeo scratch bucket, i.e if you replace
with
It also works as expected if I write and append to a local zarr.
Thanks for your help!
Environment: https://us-central1-b.gcp.pangeo.io/
Output of xr.show_versions()
INSTALLED VERSIONS ------------------ commit: None python: 3.8.6 | packaged by conda-forge | (default, Jan 25 2021, 23:21:18) [GCC 9.3.0] python-bits: 64 OS: Linux OS-release: 5.4.129+ machine: x86_64 processor: x86_64 byteorder: little LC_ALL: C.UTF-8 LANG: C.UTF-8 LOCALE: en_US.UTF-8 libhdf5: 1.10.6 libnetcdf: 4.7.4 xarray: 0.16.2 pandas: 1.2.1 numpy: 1.20.0 scipy: 1.6.0 netCDF4: 1.5.5.1 pydap: installed h5netcdf: 0.8.1 h5py: 3.1.0 Nio: None zarr: 2.6.1 cftime: 1.4.1 nc_time_axis: 1.2.0 PseudoNetCDF: None rasterio: 1.2.0 cfgrib: 0.9.8.5 iris: None bottleneck: 1.3.2 dask: 2021.01.1 distributed: 2021.01.1 matplotlib: 3.3.4 cartopy: 0.18.0 seaborn: None numbagg: None pint: 0.16.1 setuptools: 49.6.0.post20210108 pip: 20.3.4 conda: None pytest: None IPython: 7.20.0 sphinx: 3.4.3