pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.64k stars 1.09k forks source link

Missing Blocks when loading zarr file #7430

Open gdkrmr opened 1 year ago

gdkrmr commented 1 year ago

What happened?

Under load blocks of Zarr objects go missing. This happens on our minio server (see example) and on the hpc file system. This happens under load, when the filesystem gets slow, so I guess there must be a timeout somewhere.

What did you expect to happen?

A complete map.

Incomplete map:

image

Complete map, when the filesystem is not under load:

temperature_trends

Minimal Complete Verifiable Example

# Calculate global temperature trends

from dask.distributed import Client
import xarray as xr
from scipy import stats
from datetime import datetime
import matplotlib.pyplot as plt

def slope(y):
    x = list(range(0, len(y)))
    l = stats.linregress(x, y)
    return l.slope

def main():
    print(datetime.now(), "startup", flush = True)
    print(datetime.now(), "starting dask workers", flush = True)
    Client(n_workers=1, threads_per_worker=32, memory_limit='64GB')

    print(datetime.now(), "opening esdc", flush = True)
    c = xr.open_zarr("http://data.rsc4earth.de:9000/earthsystemdatacube/v3.0.1/esdc-8d-0.25deg-256x128x128-3.0.1.zarr/")

    print(datetime.now(), "getting air teperature data", flush = True)
    ct = c.air_temperature_2m

    print(datetime.now(), "setup calculations", flush = True)
    cs = xr.apply_ufunc( 
            slope, 
            ct, 
            input_core_dims=[['time']], 
            vectorize=True, 
            dask='parallelized', 
            dask_gufunc_kwargs=dict(allow_rechunk=True))

    print(datetime.now(), "saving data", flush = True)
    csset = xr.Dataset(dict(tslope = cs))
    csset.to_zarr(store="temp_slopes.zarr", mode="w")
    print(datetime.now(), "plotting", flush = True)
    cssetcalc = xr.open_zarr("temp_slopes.zarr")
    cssetcalc.tslope.plot()
    plt.savefig("temperature_trends.png")
    print(datetime.now(), "done", flush = True)

if __name__ == '__main__':
    main()

MVCE confirmation

Relevant log output

No response

Anything else we need to know?

This only seems to happen under load, so you will need to stress the server a bit to reproduce it.

Environment

INSTALLED VERSIONS ------------------ commit: None python: 3.8.15 (default, Nov 24 2022, 15:19:38) [GCC 11.2.0] python-bits: 64 OS: Linux OS-release: 4.18.0-372.26.1.el8_6.x86_64 machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: ('en_US', 'UTF-8') libhdf5: None libnetcdf: None xarray: 2022.11.0 pandas: 1.5.2 numpy: 1.23.5 scipy: 1.9.3 netCDF4: None pydap: None h5netcdf: None h5py: None Nio: None zarr: 2.13.3 cftime: None nc_time_axis: None PseudoNetCDF: None rasterio: None cfgrib: None iris: None bottleneck: 1.3.5 dask: 2022.02.1 distributed: 2022.2.1 matplotlib: 3.6.2 cartopy: None seaborn: None numbagg: None fsspec: 2022.11.0 cupy: None pint: None sparse: None flox: None numpy_groupies: None setuptools: 65.5.0 pip: 22.3.1 conda: None pytest: None IPython: None sphinx: None
Illviljan commented 1 year ago

Try updating to latest xarray and dask. dask has had some nice updates lately, https://medium.com/pangeo/dask-distributed-and-pangeo-better-performance-for-everyone-thanks-to-science-software-63f85310a36b

dcherian commented 1 year ago

This must be something to do with zarr itself.

cc @martindurant

martindurant commented 1 year ago

I recommend turning on logging in the HTTP file system

client = Client(n_workers=1, threads_per_worker=32, memory_limit='64GB')
client.run(fsspec.utils.setup_logging, logger_name="fsspec.http")
fsspec.utils.setup_logging(logger_name="fsspec.http")

and looking for errors