Closed kasra-keshavarz closed 5 months ago
Why are you doing this? It seems dangerous!
os.environ['HDF5_USE_FILE_LOCKING'] = 'FALSE'
Usually HDF Error means a corrupt file or bad disk or bad network? Can you reproduce without parallel=True
? If so, that will make it easy to ifgure out which one is bad.
Why are you doing this? It seems dangerous!
There are some situations that seem to require this, e.g. using a parallel filesystem over NFS.
In any case, I suspect this is at least related to #7079 (i.e. recent netcdf4
+ threading scheduler arbitrarily fail because of race conditions). Edit: maybe not? threads_per_worker
is set to 1
Why are you doing this? It seems dangerous!
os.environ['HDF5_USE_FILE_LOCKING'] = 'FALSE'
I just followed the xarray
's documentation: https://github.com/pydata/xarray/blob/2b444af78cef1aaf4bbd9e4bded246d1d0defddb/doc/user-guide/dask.rst?plain=1#L147-L153
Usually HDF Error means a corrupt file or bad disk or bad network? Can you reproduce without
parallel=True
? If so, that will make it easy to ifgure out which one is bad.
Here it is the same code without the parallel=True
option:
import os
import dask
import xarray as xr
from dask.distributed import Client, LocalCluster
cluster = LocalCluster(n_workers=4, threads_per_worker=1) # 1 core to each worker
client = Client(cluster)
os.environ['HDF5_USE_FILE_LOCKING'] = 'FALSE'
ds = xr.open_mfdataset('./remapped/*.nc', chunks={'COMID': 1400})
ds.to_netcdf('./nonparallel_out.nc')
And the relevant output (which fails at the end and kills the kernel):
As I said earlier, I am working on an HPC, and the resource SLURM
job allocated to this specific error is quite generous. Working on 1 node with 10 CPUs and 62GBs of RAM.
There are some situations that seem to require this, e.g. using a parallel filesystem over NFS.
The filesystem I am working on is "Lustre."
Let me know if you need more information.
The only solution that I have found for my problem is using unique engines for read and write operations. Specifically, h5netcdf
to read, and netcdf4
to write. Surprisingly, h5netcdf
is more efficient in reading, and netcdf4
is similar in writing. Hope this helps future visitors.
Going to close here, given the age of this issue. Please feel free to reopen.
The only solution that I have found for my problem is using unique engines for read and write operations.
This comment helped me to find a workaround to a similar issue. I suddenly started getting HDF error
on .to_netcdf(engine='netcdf4')
writes on some environments after adding seemingly non-related code that directly uses h5py
. The workaround was to switch the engine to h5netcdf
.
What is your issue?
I am simply reading 366 small (~15MBs) NetCDF files to create one big NetCDF file at the end. Below is the relevant workflow:
And below, is the error I am getting:
Error message
```python-console In [8]: ds.to_netcdf('./out2.nc') /home/kasra545/virtual-envs/meshflow/lib/python3.10/site-packages/distributed/client.py:3149: UserWarning: Sending large graph of size 9.97 MiB. This may cause some slowdown. Consider scattering data ahead of time and using futures. warnings.warn( 2023-09-18 22:26:14,279 - distributed.worker - WARNING - Compute Failed Key: ('open_dataset-concatenate-concatenate-be7dd534c459e2f316d9149df2d9ec95', 178, 0) Function: getter args: (ImplicitToExplicitIndexingAdapter(array=CopyOnWriteArray(array=LazilyIndexedArray(array=_ElementwiseFunctionArray(LazilyIndexedArray(array=The header of individual NetCDF ones are also in the following:
Individual NetCDF header
```console $ ncdump -h ab_models_remapped_1980-04-20-13-00-00.nc netcdf ab_models_remapped_1980-04-20-13-00-00 { dimensions: COMID = 14980 ; time = UNLIMITED ; // (24 currently) variables: int time(time) ; time:long_name = "time" ; time:units = "hours since 1980-04-20 12:00:00" ; time:calendar = "gregorian" ; time:standard_name = "time" ; time:axis = "T" ; double latitude(COMID) ; latitude:long_name = "latitude" ; latitude:units = "degrees_north" ; latitude:standard_name = "latitude" ; double longitude(COMID) ; longitude:long_name = "longitude" ; longitude:units = "degrees_east" ; longitude:standard_name = "longitude" ; double COMID(COMID) ; COMID:long_name = "shape ID" ; COMID:units = "1" ; double RDRS_v2.1_P_P0_SFC(time, COMID) ; RDRS_v2.1_P_P0_SFC:_FillValue = -9999. ; RDRS_v2.1_P_P0_SFC:long_name = "Forecast: Surface pressure" ; RDRS_v2.1_P_P0_SFC:units = "mb" ; double RDRS_v2.1_P_HU_1.5m(time, COMID) ; RDRS_v2.1_P_HU_1.5m:_FillValue = -9999. ; RDRS_v2.1_P_HU_1.5m:long_name = "Forecast: Specific humidity" ; RDRS_v2.1_P_HU_1.5m:units = "kg kg**-1" ; double RDRS_v2.1_P_TT_1.5m(time, COMID) ; RDRS_v2.1_P_TT_1.5m:_FillValue = -9999. ; RDRS_v2.1_P_TT_1.5m:long_name = "Forecast: Air temperature" ; RDRS_v2.1_P_TT_1.5m:units = "deg_C" ; double RDRS_v2.1_P_UVC_10m(time, COMID) ; RDRS_v2.1_P_UVC_10m:_FillValue = -9999. ; RDRS_v2.1_P_UVC_10m:long_name = "Forecast: Wind Modulus (derived using UU and VV)" ; RDRS_v2.1_P_UVC_10m:units = "kts" ; double RDRS_v2.1_A_PR0_SFC(time, COMID) ; RDRS_v2.1_A_PR0_SFC:_FillValue = -9999. ; RDRS_v2.1_A_PR0_SFC:long_name = "Analysis: Quantity of precipitation" ; RDRS_v2.1_A_PR0_SFC:units = "m" ; double RDRS_v2.1_P_FB_SFC(time, COMID) ; RDRS_v2.1_P_FB_SFC:_FillValue = -9999. ; RDRS_v2.1_P_FB_SFC:long_name = "Forecast: Downward solar flux" ; RDRS_v2.1_P_FB_SFC:units = "W m**-2" ; double RDRS_v2.1_P_FI_SFC(time, COMID) ; RDRS_v2.1_P_FI_SFC:_FillValue = -9999. ; RDRS_v2.1_P_FI_SFC:long_name = "Forecast: Surface incoming infrared flux" ; RDRS_v2.1_P_FI_SFC:units = "W m**-2" ; ```I am running
xarray
andDask
on an HPC, so the "modules" I have loaded are the following:Any suggestion is greatly appreciated!