pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.62k stars 1.08k forks source link

ENH: Compute hash of xarray objects #4738

Open andersy005 opened 3 years ago

andersy005 commented 3 years ago

Is your feature request related to a problem? Please describe.

I'm working on some caching/data-provenance functionality for xarray objects, and I realized that there's no standard/efficient way of computing hashes for xarray objects.

Describe the solution you'd like

It would be useful to have a configurable, reliable/standard .hexdigest() method on xarray objects. For example, zarr provides a digest method that returns you a digest/hash of the data.

In [16]: import zarr

In [17]: z = zarr.zeros(shape=(10000, 10000), chunks=(1000, 1000))

In [18]: z.hexdigest() # uses sha1 by default for speed
Out[18]: '7162d416d26a68063b66ed1f30e0a866e4abed60'

In [20]: z.hexdigest(hashname='sha256')
Out[20]: '46fc6e52fc1384e37cead747075f55201667dd539e4e72d0f372eb45abdcb2aa'

I'm thinking that an xarray's built-in hashing mechanism would provide a more reliable way to treat metadata such as global attributes, encoding, etc... during the hash computation...

Describe alternatives you've considered

So far, I am using joblib's default hasher: joblib.hash() function. However, I am in favor of having a configurable/built-in hasher that is aware of xarray's data model and quirks :)

In [1]: import joblib

In [2]: import xarray as xr

In [3]: ds = xr.tutorial.open_dataset('rasm')

In [5]: joblib.hash(ds, hash_name='sha1')
Out[5]: '3e5e3f56daf81e9e04a94a3dff9fdca9638c36cf'

In [8]: ds.attrs = {}

In [9]: joblib.hash(ds, hash_name='sha1')
Out[9]: 'daab25fe735657e76514040608fadc67067d90a0'

Additional context Add any other context about the feature request here.

shoyer commented 3 years ago

Interesting! Do pandas or dask have anything like this?

andersy005 commented 3 years ago

Pandas has a built-in utility function pd.util.hash_pandas_object:

In [1]: import pandas as pd

In [3]: df = pd.DataFrame({'A': [4, 5, 6, 7], 'B': [10, 20, 30, 40], 'C': [100, 50, -30, -50]})

In [4]: df
Out[4]:
   A   B    C
0  4  10  100
1  5  20   50
2  6  30  -30
3  7  40  -50

In [6]: row_hashes = pd.util.hash_pandas_object(df)

In [7]: row_hashes
Out[7]:
0    14190898035981950066
1    16858535338008670510
2     1055569624497948892
3     5944630256416341839
dtype: uint64

Combining the returned value of hash_pandas_object() with Python's hashlib gives something one can work with:

In [8]: import hashlib

In [10]: hashlib.sha1(row_hashes.values).hexdigest() # Compute overall hash of all rows.
Out[10]: '1e1244d9b0489e1f479271f147025956d4994f67'

Regarding dask, I have no idea :) cc @TomAugspurger

TomAugspurger commented 3 years ago

IIUC, something like https://github.com/dask/dask/blob/4a7a2438219c4ee493434042e50f4cdb67b6ec9f/dask/base.py#L778 is what you're looking for. Further down we register tokenizers for various types like pandas' DataFrames and ndarrays.

shoyer commented 3 years ago

I asked because this isn't an operation I've used directly on pandas objects in the past. I'm not opposed, but my suggestion would be to write a separate utility function, e.g., in xarray.util (similar to what is in pandas) rather than making it method on xarray objects themselves.

dcherian commented 3 years ago

@andersy005 if you can rely on dask always being present, dask.base.tokenize(xarray_object) will do what you want.

andersy005 commented 3 years ago

@andersy005 if you can rely on dask always being present, dask.base.tokenize(xarray_object) will do what you want.

👍🏽 dask.base.tokenize() achieves what I need for my use case.

I asked because this isn't an operation I've used directly on pandas objects in the past. I'm not opposed, but my suggestion would be to write a separate utility function, e.g., in xarray.util (similar to what is in pandas) rather than making it method on xarray objects themselves.

Due to the simplicity of dask.base.tokenize(), I am now wondering whether it's even worth having a utility function in xarray.util for computing a deterministic token (~hash) for an xarray object? I'm happy to work on this if there's interest from other folks, otherwise I will close this issue.

andersy005 commented 2 years ago

@andersy005 if you can rely on dask always being present, dask.base.tokenize(xarray_object) will do what you want.

@dcherian, I just realized that dask.base.tokenize deosn't return a deterministic token for xarray objects:

In [2]: import dask, xarray as xr

In [3]: ds = xr.tutorial.open_dataset('rasm')

In [4]: dask.base.tokenize(ds) == dask.base.tokenize(ds)
Out[4]: False

In [5]: dask.base.tokenize(ds) == dask.base.tokenize(ds)
Out[5]: False

The issue appears to be caused by the coordinates which are used in __dask_tokenize__

https://github.com/pydata/xarray/blob/dbc02d4e51fe404e8b61656f2089efadbf99de28/xarray/core/dataarray.py#L870-L873

In [8]: dask.base.tokenize(ds.Tair.data) == dask.base.tokenize(ds.Tair.data)
Out[8]: True
In [16]: dask.base.tokenize(ds.Tair._coords) == dask.base.tokenize(ds.Tair._coords)
Out[16]: False

Is this the expected behavior or am I missing something?

andersy005 commented 2 years ago

The issue appears to be caused by the coordinates which are used in __dask_tokenize__

I tried running the reproducer above and things seem to be working fine. I can't for the life of me understand why I got non-deterministic behavior four hours ago :(

In [1]: import dask, xarray as xr

In [2]: ds = xr.tutorial.open_dataset('rasm')

In [3]: dask.base.tokenize(ds) == dask.base.tokenize(ds)
Out[3]: True

In [4]: dask.base.tokenize(ds.Tair._coords) == dask.base.tokenize(ds.Tair._coords)
Out[4]: True
In [5]: xr.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.9.7 | packaged by conda-forge | (default, Sep 29 2021, 20:33:18) 
[Clang 11.1.0 ]
python-bits: 64
OS: Darwin
OS-release: 20.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: ('en_US', 'UTF-8')
libhdf5: 1.12.1
libnetcdf: 4.8.1

xarray: 0.20.1
pandas: 1.3.4
numpy: 1.20.3
scipy: 1.7.3
netCDF4: 1.5.8
pydap: None
h5netcdf: 0.11.0
h5py: 3.6.0
Nio: None
zarr: 2.10.3
cftime: 1.5.1.1
nc_time_axis: None
PseudoNetCDF: None
rasterio: None
cfgrib: None
iris: None
bottleneck: None
dask: 2021.11.2
distributed: 2021.11.2
matplotlib: 3.5.0
cartopy: None
seaborn: None
numbagg: None
fsspec: 2021.11.1
cupy: None
pint: 0.18
sparse: None
setuptools: 59.4.0
pip: 21.3.1
conda: None
pytest: None
IPython: 7.30.0
sphinx: 4.3.1
andersy005 commented 2 years ago

Okay... I think the following comment is still valid:

The issue appears to be caused by the coordinates which are used in __dask_tokenize__

It appears that the deterministic behavior of the tokenization process is affected depending on whether the dataset/datarray contains non-dimension coordinates or dimension coordinates

In [2]: ds = xr.tutorial.open_dataset('rasm')
In [39]: a = ds.isel(time=0)

In [40]: a
Out[40]: 
<xarray.Dataset>
Dimensions:  (y: 205, x: 275)
Coordinates:
    time     object 1980-09-16 12:00:00
    xc       (y, x) float64 189.2 189.4 189.6 189.7 ... 17.65 17.4 17.15 16.91
    yc       (y, x) float64 16.53 16.78 17.02 17.27 ... 28.26 28.01 27.76 27.51
Dimensions without coordinates: y, x
Data variables:
    Tair     (y, x) float64 ...

In [41]: dask.base.tokenize(a) == dask.base.tokenize(a)
Out[41]: True
In [42]: b = ds.isel(y=0)

In [43]: b
Out[43]: 
<xarray.Dataset>
Dimensions:  (time: 36, x: 275)
Coordinates:
  * time     (time) object 1980-09-16 12:00:00 ... 1983-08-17 00:00:00
    xc       (x) float64 189.2 189.4 189.6 189.7 ... 293.5 293.8 294.0 294.3
    yc       (x) float64 16.53 16.78 17.02 17.27 ... 27.61 27.36 27.12 26.87
Dimensions without coordinates: x
Data variables:
    Tair     (time, x) float64 ...

In [44]: dask.base.tokenize(b) == dask.base.tokenize(b)
Out[44]: False

This looks like a bug in my opinion...

LunarLanding commented 2 years ago

This looks like a bug in my opinion...

@andersy005

This runs with not issues atm.

with dask.config.set({"tokenize.ensure-deterministic":True}):
    ds = xr.tutorial.open_dataset('rasm')
    b = ds.isel(y=0)
    assert dask.base.tokenize(b) == dask.base.tokenize(b)

With:

xarray                    2022.3.0           pyhd8ed1ab_0    conda-forge
dask                      2022.5.0           pyhd8ed1ab_0    conda-forge
matanox commented 11 months ago

Are xarray objects robustly hashable now?