Open rabernat opened 1 year ago
Shoot, I'm still getting the read_only errors with 0.5.1: https://nbviewer.org/gist/85a34aed6e432d0d8502841076bbab92
I think you may be hitting a version of https://github.com/zarr-developers/zarr-python/pull/1353 because you are calling
m = fs.get_mapper("")
Try updating to the latest zarr version, or else creating an FSStore instead.
Okay, will do!
Would be helpful to confirm which Zarr version you had installed.
Hmm, zarr=2.13.6
, the latest from conda-forge. I see that zarr=2.14.2
has been released though. I'll try pip installing that.
Okay, with the latest zarr=2.14.2
, I don't get the read_only
errors.
But the workflow fails near the end of the rechunking process:
KilledWorker: Attempted to run task ('copy_intermediate_to_write-bca90f45d4dc080cca14b54ce5a10d1f', 2) on 3 different workers, but all those workers died while running it. The last worker that attempt to run the task was tls://10.10.105.181:35291. Inspecting worker logs is often a good next step to diagnose what went wrong. For more information see https://distributed.dask.org/en/stable/killed.html.
The logs from those workers are not available on the dashboard, I guess because the workers died, right?
This rechunker workflow was working in December. Should I revert to zarr and rechunker from that era?
Ideally you would figure out what is going wrong and help us fix it, rather than rolling back to an earlier version. After all, you're a rechunker maintainer now! 😉
Are you sure that all your package versions match on your workers?
I'm certainly willing to try to help debug it, but don't really know where to start. If you have ideas, I'm game to try them.
One of the nice things about nebari/conda-store is the notebook and workers see the same environment (accessed from the conda-store pod), so the versions always match.
I added you to the ESIP Nebari deployment if you are interested in checking it out.
https://nebari.esipfed.org/hub/user-redirect/lab/tree/shared/users/Welcome.ipynb
I won't be able to log into the ESIP cluster to debug your failing computation. If you think there has been a regression in rechunker in the new release, I strongly encourage you to develop a minimum reproducible example and share it via the issue tracker.
If you have ideas, I'm game to try them.
My first idea would be to freeze every package version except rechunker in your environment, and then try running the exact same workflow with only different rechunker versions (say 0.5.0 vs 0.5.1). Your example has a million moving pieces. Dask, Zarr, kerchunk, xarray, etc etc. It's impossible to say whether your problem is caused by a change in rechunker unless you can isolate this. There have been extremely few changes to rechunker over the past year. Nothing that obviously would cause your dask workers to start running out of memory.
I've confirmed that my rechunking workflow runs successfully if I pin zarr=2.13.3
:
cf_xarray 0.8.0 pyhd8ed1ab_0 conda-forge
dask 2023.3.1 pyhd8ed1ab_0 conda-forge
dask-core 2023.3.1 pyhd8ed1ab_0 conda-forge
dask-gateway 2022.4.0 pyh8af1aa0_0 conda-forge
dask-geopandas 0.3.0 pyhd8ed1ab_0 conda-forge
dask-image 2022.9.0 pyhd8ed1ab_0 conda-forge
fsspec 2023.3.0+5.gbac7529 pypi_0 pypi
intake-xarray 0.6.1 pyhd8ed1ab_0 conda-forge
jupyter_server_xarray_leaflet 0.2.3 pyhd8ed1ab_0 conda-forge
numcodecs 0.11.0 py310heca2aa9_1 conda-forge
pint-xarray 0.3 pyhd8ed1ab_0 conda-forge
rechunker 0.5.1 pypi_0 pypi
rioxarray 0.13.4 pyhd8ed1ab_0 conda-forge
s3fs 2022.11.0 py310h06a4308_0
xarray 2023.2.0 pyhd8ed1ab_0 conda-forge
xarray-datatree 0.0.12 pyhd8ed1ab_0 conda-forge
xarray-spatial 0.3.5 pyhd8ed1ab_0 conda-forge
xarray_leaflet 0.2.3 pyhd8ed1ab_0 conda-forge
zarr 2.13.3 pyhd8ed1ab_0 conda-forge
zarr=2.13.6
I get the ReadOnlyError: object is read-only
error. zarr=2.14.2
I get the dask workers dying. @gzt5142 has a minimal reproducible example he will post shortly. But should this be raised as a zarr issue?
Thanks a lot for looking into this Rich!
But should this be raised as a zarr issue?
How minimal is it? Can you decouple it from the dask and rechunker issues? Can you say more about what you think the root problem is?
Unfortunately it turns out the minimal example we created works fine -- does not trigger the problem described here. :(
I'm going to reopen this issue.
If there is a bug somewhere in our stack that is preventing rechunker from working properly, we really need to get to the bottom of it.
Tests with the latest dev environment are failing with errors like this
This is the cause of the test failures in #134.