Open ifenty opened 2 years ago
Hi @ifenty,
I don't believe I ever got around this issue, and trying it again now with updated libraries nothing seems to have changed. What I ended up doing was using fsspec/s3fs to use s3 for the object paths, and then creating a list of https://
urls with #mode=bytes
appended to the end of the filename which allows for access of specific byte-ranges over https. xarray is able to open things this way, though I'm guessing it would be faster to go via s3.
Here's a quick example I put togehter https://gist.github.com/lsterzinger/77153e08353885d15497702bc4db67b9
@lsterzinger thanks for the very fast response. In my environment, I could get your code to work with parallel=True for a small number of files (n<6). For 24 files, the dask tasks run for a while, then abruptly terminate, leaving the cell in a hung [*] state. The hunt continues.
Hi, @lsterzinger I think I am encountering the same issue you described in "Troubleshooting Dask Issues w/ GOES S3 Data".
I'm trying to open netCDF files over S3 with xr.open_mfdataset(). If a dask client is created (and attached) before the open_mfdataset call, the process hangs. when there is no dask client, the files open via open_mfdataset (albeit sequentially=slowly).
Did you find a fix to this problem? I found some posts that spoke about circular references in h5netcdf with suggestions to 'update to the latest versions' which I did but the problem persists.
Thanks in advance. Ian
https://github.com/lsterzinger/pangeo-cloud-L2-satellite/tree/main/dask_troubleshooting