Closed howff closed 2 weeks ago
So is the unsaid part of this issue that if you don't first request cloud_phase
the snow_age
loads just fine?
What if you do:
wanted = 'cloud_phase'
scn.load(["cloud_phase", "snow_age"])
lcn = scn.resample('antarctica')
print(lcn["snow_age"].attrs["area"])
That's right, snow_age by itself loads fine, and other composites before snow_age don't provoke the error, but cloud_phase then snow_age fails.
Your suggestion does work, and a couple of other rearrangements also work, so it's the example in the original report which is provoking the problem.
Ok so I'm looking at this on my own machine and you can actually narrow it down to:
scn.load(["cloud_phase"])
scn.load(["snow_age"])
print(scn["snow_age"].attrs["area"])
If I add unload=False
to the snow_age loading and loop through all the DataArrays and collect their area
attributes into a set
I get 3 different SwathDefinitions. One for the I band resolution but 2 for the M band resolution. It seems the odd one out is the M11 band (which is used by both composites). All other M bands have the same SwathDefinition. When snow_age combines the metadata of the inputs it is going to see that they don't all have the same area and won't include that piece of metadata. On one hand this is a bug in the SnowAge code because it should be doing this area checking before the rest of the code continues, but theoretically this shouldn't be possible for M band only inputs to have different swaths.
More debugging...
Whoa...not good:
In [9]: scn = Scene(reader="viirs_sdr", filenames=glob("/data/satellite/viirs/conus_day/*t1801*.h5"))
In [10]: scn.load(["M07"])
In [11]: scn.load(["M11"])
In [12]: scn["M11"].attrs["area"]
Out[12]: <pyresample.geometry.SwathDefinition at 0x7f69bd583fe0>
In [13]: scn["M07"].attrs["area"]
Out[13]: <pyresample.geometry.SwathDefinition at 0x7f6a2553ecf0>
The swaths are different.
Ok, the quick workaround is to make sure you load everything at the same time.
@pytroll/satpy-core this is a major limitation and unexpected behavior of Satpy. This effects at least h5py-based readers using the HDF5 utils file handler. When it loads data it gets to here:
This dset
object is an h5py Dataset object and is passed to dask's from_array
. Dask is able to realize this is array-like and will continue on with the h5py Dataset object as far as I can tell. BUT the .name
it generates changes every time it calls. The name
is what is used as the label and identifier to the dask tasks and tells you if a computation can be reused/shared or is the same as another task. It is also used when a SwathDefinition
is hashed and dask arrays are involved (as a shortcut).
What happens is that dask's tokenize
function tries to hash the h5py Dataset object, realizes it is an object but not one it knows how to hash/tokenize, and ends up generating a "random" UUID4.
As far as I can tell one solution would be to try to add a __dask_tokenize__
method to the h5py object after the fact (if it lets us) and dask will find that and use it. We could base the resulting token off of the h5py filename, variable name, etc.
Awesome deductive powers @djhoese - thanks for the amazingly fast investigation and workaround :-)
This
dset
object is an h5py Dataset object and is passed to dask'sfrom_array
. Dask is able to realize this is array-like and will continue on with the h5py Dataset object as far as I can tell. BUT the.name
it generates changes every time it calls. Thename
is what is used as the label and identifier to the dask tasks and tells you if a computation can be reused/shared or is the same as another task. It is also used when aSwathDefinition
is hashed and dask arrays are involved (as a shortcut).What happens is that dask's
tokenize
function tries to hash the h5py Dataset object, realizes it is an object but not one it knows how to hash/tokenize, and ends up generating a "random" UUID4.As far as I can tell one solution would be to try to add a
__dask_tokenize__
method to the h5py object after the fact (if it lets us) and dask will find that and use it. We could base the resulting token off of the h5py filename, variable name, etc.
How about putting an lru_cache around that function maybe? so when the same data is requested multiple time, the (same) cached version is returned.
@mraspaud That may be an option if we're careful. I was initially worried because of the hdf5 objects involved, but it looks like a string key
is provided to the function, the self.file_content[key]
isn't actually used, and then only self.filename
(another string or Path-like object) is used I think. I think we could cache the resulting dask array portion of the function...if that is dask-friendly. Or as a last resort we could generate the dask array once and return the token generated by dask and force that when calling from_array
.
Thank you :-)
Describe the bug
Producing the
snow_age
composite after thecloud_phase
composite fails. I am testing with some VIIRS data in .h5 format.To Reproduce
Expected behavior Both composites created
Actual results
Environment Info: