leap-stc / cmip6-leap-feedstock

Apache License 2.0
12 stars 5 forks source link

Unable to open file thrown for some jobs #31

Open jbusecke opened 1 year ago

jbusecke commented 1 year ago

A couple of jobs lately failed out with the following error:

...
"/srv/conda/envs/notebook/lib/python3.9/site-packages/h5py/_hl/files.py", line 226, in make_fid fid = h5f.open(name, flags, fapl=fapl) File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper File "h5py/h5f.pyx", line 106, in h5py.h5f.open OSError: Unable to open file (truncated file: eof = 1878556231, sblock->base_addr = 0, stored_eof = 14763458119) [while running 'Creating CMIP6.ScenarioMIP.IPSL.IPSL-CM6A-LR.ssp585.r1i1p1f1.Omon.zmeso.gn.v20190903|OpenURLWithFSSpec|OpenWithXarray|Preprocessor|StoreToZarr|Logging to non-QC table|TestDataset|Logging to QC table/OpenWithXarray/Open with Xarray-ptransform-87']

I am able to reproduce this with the cached file:

f = "gs://leap-scratch/data-library/cache/b6430036d547ee167decac45ca4a44c2-http_vesg.ipsl.upmc.fr_thredds_fileserver_cmip6_scenariomip_ipsl_ipsl-cm6a-lr_ssp585_r1i1p1f1_omon_zmeso_gn_v20190903_zmeso_omon_ipsl-cm6a-lr_ssp585_r1i1p1f1_gn_210101-220012.nc"

ds = xr.open_dataset(f, use_cftime=True, chunks={})

Wondering if this means the file is corrupted or if we could fix this somehow.

Dataflow job)

cisaacstern commented 1 year ago

With

import xarray as xr
import gcsfs
gcs = gcsfs.GCSFileSystem()

f = "gs://leap-scratch/data-library/cache/b6430036d547ee167decac45ca4a44c2-http_vesg.ipsl.upmc.fr_thredds_fileserver_cmip6_scenariomip_ipsl_ipsl-cm6a-lr_ssp585_r1i1p1f1_omon_zmeso_gn_v20190903_zmeso_omon_ipsl-cm6a-lr_ssp585_r1i1p1f1_gn_210101-220012.nc"

ds = xr.open_dataset(gcs.open(f), use_cftime=True, chunks={})
OSError: Unable to synchronously open file (truncated file: eof = 1878556231, sblock->base_addr = 0, stored_eof = 14763458119)

So maybe the caching was interrupted resulting in "truncated file"?

jbusecke commented 1 year ago

Interesting. Should we manually delete the cache and rebuild in that case? Or is there a way to trigger a recaching from within the recipe if this sort of error occurs?

cisaacstern commented 1 year ago

Or is there a way to trigger a recaching from within the recipe if this sort of error occurs?

You can just re-run the job, the caching step should recognize that the already cached file is a different size than the source file, and re-cache. Note here caching is only skipped if source and cached files are the same size:

https://github.com/pangeo-forge/pangeo-forge-recipes/blob/5e9eae41bb549ee00b6495453fa1dab52cf12599/pangeo_forge_recipes/storage.py#L177-L183

jbusecke commented 1 year ago

Ill leave this open for now, but I suspect that either the source file is broken or the thing will fix itself. Either way no action on our end needed. Thx @cisaacstern