lsterzinger / pangeo-cloud-L2-satellite

MIT License
1 stars 0 forks source link

encountering the same issue you described in "Troubleshooting Dask Issues w/ GOES S3 Data". #1

Open ifenty opened 2 years ago

ifenty commented 2 years ago

Hi, @lsterzinger I think I am encountering the same issue you described in "Troubleshooting Dask Issues w/ GOES S3 Data".

I'm trying to open netCDF files over S3 with xr.open_mfdataset(). If a dask client is created (and attached) before the open_mfdataset call, the process hangs. when there is no dask client, the files open via open_mfdataset (albeit sequentially=slowly).

Did you find a fix to this problem? I found some posts that spoke about circular references in h5netcdf with suggestions to 'update to the latest versions' which I did but the problem persists.

Thanks in advance. Ian

https://github.com/lsterzinger/pangeo-cloud-L2-satellite/tree/main/dask_troubleshooting

lsterzinger commented 2 years ago

Hi @ifenty,

I don't believe I ever got around this issue, and trying it again now with updated libraries nothing seems to have changed. What I ended up doing was using fsspec/s3fs to use s3 for the object paths, and then creating a list of https:// urls with #mode=bytes appended to the end of the filename which allows for access of specific byte-ranges over https. xarray is able to open things this way, though I'm guessing it would be faster to go via s3.

Here's a quick example I put togehter https://gist.github.com/lsterzinger/77153e08353885d15497702bc4db67b9

ifenty commented 2 years ago

@lsterzinger thanks for the very fast response. In my environment, I could get your code to work with parallel=True for a small number of files (n<6). For 24 files, the dask tasks run for a while, then abruptly terminate, leaving the cell in a hung [*] state. The hunt continues.