Open gerritholl opened 2 weeks ago
Both readers use map_blocks
, which might be one way in which a filehandler reference accidentally ends up in the graph:
and
but the FCI code does not use the index map (the code in question is never reached). The LI use case for map_blocks
does, but an ad-hoc change to replace it by something else does not appear to solve the problem (output and error message remain the same).
If I change v
to v[:]
in da.from_array
, thus converting the netCDF4.Variable
into numpy.ndarray
at
then the code completes successfully (but no longer reads lazily).
So maybe it's not the filehandler, but it's the dask graphs themselves that are unpicklable due to containing open NetCDF variables.
But unpicklable dask objects are still computable:
import netCDF4
import dask.array as da
from dask.distributed import Client, LocalCluster
def main():
cluster = LocalCluster()
client = Client(cluster)
nc = netCDF4.Dataset("/media/nas/x21308/MTG_test_data/LI/x/W_XX-EUMETSAT-Darmstadt,IMG+SAT,MTI1+LI-2-AF--FD--CHK-BODY---NC4E_C_EUMT_20240613045114_L2PF_OPE_20240613045030_20240613045100_N__T_0030_0002.nc")
nc["x"]
v = da.from_array(nc["x"])
print(v.compute())
if __name__ == "__main__":
main()
so the presence of this variable by itself shouldn't prevent the dask graph to be computable.
dask distributed is supposed to contain a custom pickler class to handle exactly such cases, and/or use cloudpickle which can handle variables not otherwise picklable. This pull request explicitly refers to the data variable case (from h5netcdf, but probably the same from NetCDF4).
NB: The NetCDF4FileHandler
uses xarray to open datasets when cache_handles=False
(the default), but netCDF4
when cache_handles=True
. Although we cannot pickle a dask array created from a NetCDF4 variable or a h5netcdf variable, we can pickle an xarray.Dataset created from the same, due to additional functionality in xarray. This might explain why we can use dask distributed when cache_handles=False
but not when it is set to True.
Maybe we should use CachingFileManager
, but I'm not sure what to replace da.from_array(v)
with in this case. Some sort of map_blocks
call like in https://github.com/pydata/xarray/issues/4242#issuecomment-2009576810?
Describe the bug
When using the dask distributed
LocalCluster
, computing LI or FCI data gives corrupted data. Attempting to save the datasets to disk fails with several exceptions, with the root cause that a_netCDF4.Variable
is not picklable.This might affect other readers as well.
To Reproduce For FCI:
And for LI accumulated products:
Expected behavior
I expect that the print statements result in the correct value, namely the same value as when I comment out the cluster usage. Furthermore, I expect both scripts to produce a simple image.
Actual results
With the distributed scheduler for FCI:
If I comment out the
save_datasets
call, the code completes with:The value is wrong (it shouldn't even be a float). For reference, when I use the regular scheduler, I get the (probably correct) value of 1025 (integer)
If I load calibrated data, all data are NaN.
I get similar problems with LI, but not with IASI L2 CDR. All three use file handlers deriving from
NetCDF4FsspecFileHandler
, but only FCI and LI usecache_handle=True
, which seems to trigger the problem. I did not try MWS (which also usescache_handle=True
).Environment Info:
Additional context
I ran into this problem when working on #2686.
It would seem that references to the file handler object end up in the dask graph, which should probably be avoided. They fail to be pickled, because they contain references to NetCDF4 objects.
In #1546, a similar problem was solved for VIIRS Compact.
I don't know yet if the hypothesis is correct and if so, how those references end up in the dask graph.