pytroll / satpy

Python package for earth-observing satellite data processing
http://satpy.readthedocs.org/en/latest/
GNU General Public License v3.0
1.04k stars 287 forks source link

Composite snow_age fails (no 'area') after composite cloud_phase #2724

Closed howff closed 2 weeks ago

howff commented 5 months ago

Describe the bug

Producing the snow_age composite after the cloud_phase composite fails. I am testing with some VIIRS data in .h5 format.

To Reproduce

from satpy.scene import Scene
from glob import glob

coast_dir = '/opt/gshhg'
all_overlays = { 'coast': {'level':1} }

myfiles = glob("data/*h5")
scn = Scene(filenames=myfiles, reader='viirs_sdr')

wanted = 'cloud_phase'
scn.load([wanted])
lcn = scn.resample('antarctica')
print('ATTRS for %s = %s' % (wanted, lcn[wanted].attrs))
lcn.save_dataset(wanted, 'test.%s.png' % wanted, overlay = {'overlays': all_overlays, 'coast_dir':coast_dir})

wanted = 'snow_age'
scn.load([wanted])
lcn = scn.resample('antarctica')
print('ATTRS for %s = %s' % (wanted, lcn[wanted].attrs))
lcn.save_dataset(wanted, 'test.%s.png' % wanted, overlay = {'overlays': all_overlays, 'coast_dir':coast_dir})

Expected behavior Both composites created

Actual results

  File "./read2.py", line 21, in <module>
    lcn.save_dataset(wanted, 'test.%s.png' % wanted, overlay = {'overlays': all_overlays, 'coast_dir':coast_dir})
  File "miniconda3/lib/python3.11/site-packages/satpy/scene.py", line 1295, in save_dataset
    return writer.save_dataset(self[dataset_id],
  File "miniconda3/lib/python3.11/site-packages/satpy/writers/__init__.py", line 883, in save_dataset
    img = get_enhanced_image(dataset.squeeze(), enhance=self.enhancer, overlay=overlay,
  File "miniconda3/lib/python3.11/site-packages/satpy/writers/__init__.py", line 448, in get_enhanced_image
    img = add_overlay(img, dataset.attrs["area"], fill_value=fill_value, **overlay)
KeyError: 'area'

Environment Info:

djhoese commented 5 months ago

So is the unsaid part of this issue that if you don't first request cloud_phase the snow_age loads just fine?

What if you do:

wanted = 'cloud_phase'
scn.load(["cloud_phase", "snow_age"])
lcn = scn.resample('antarctica')
print(lcn["snow_age"].attrs["area"])
howff commented 5 months ago

That's right, snow_age by itself loads fine, and other composites before snow_age don't provoke the error, but cloud_phase then snow_age fails.

Your suggestion does work, and a couple of other rearrangements also work, so it's the example in the original report which is provoking the problem.

djhoese commented 5 months ago

Ok so I'm looking at this on my own machine and you can actually narrow it down to:

scn.load(["cloud_phase"])
scn.load(["snow_age"])
print(scn["snow_age"].attrs["area"])

If I add unload=False to the snow_age loading and loop through all the DataArrays and collect their area attributes into a set I get 3 different SwathDefinitions. One for the I band resolution but 2 for the M band resolution. It seems the odd one out is the M11 band (which is used by both composites). All other M bands have the same SwathDefinition. When snow_age combines the metadata of the inputs it is going to see that they don't all have the same area and won't include that piece of metadata. On one hand this is a bug in the SnowAge code because it should be doing this area checking before the rest of the code continues, but theoretically this shouldn't be possible for M band only inputs to have different swaths.

More debugging...

djhoese commented 5 months ago

Whoa...not good:


In [9]: scn = Scene(reader="viirs_sdr", filenames=glob("/data/satellite/viirs/conus_day/*t1801*.h5"))

In [10]: scn.load(["M07"])

In [11]: scn.load(["M11"])

In [12]: scn["M11"].attrs["area"]
Out[12]: <pyresample.geometry.SwathDefinition at 0x7f69bd583fe0>

In [13]: scn["M07"].attrs["area"]
Out[13]: <pyresample.geometry.SwathDefinition at 0x7f6a2553ecf0>

The swaths are different.

djhoese commented 5 months ago

Ok, the quick workaround is to make sure you load everything at the same time.

@pytroll/satpy-core this is a major limitation and unexpected behavior of Satpy. This effects at least h5py-based readers using the HDF5 utils file handler. When it loads data it gets to here:

https://github.com/pytroll/satpy/blob/d3fe3fe5f71479ee5f4fae30dfe3728932a8cd2e/satpy/readers/hdf5_utils.py#L98-L111

This dset object is an h5py Dataset object and is passed to dask's from_array. Dask is able to realize this is array-like and will continue on with the h5py Dataset object as far as I can tell. BUT the .name it generates changes every time it calls. The name is what is used as the label and identifier to the dask tasks and tells you if a computation can be reused/shared or is the same as another task. It is also used when a SwathDefinition is hashed and dask arrays are involved (as a shortcut).

What happens is that dask's tokenize function tries to hash the h5py Dataset object, realizes it is an object but not one it knows how to hash/tokenize, and ends up generating a "random" UUID4.

As far as I can tell one solution would be to try to add a __dask_tokenize__ method to the h5py object after the fact (if it lets us) and dask will find that and use it. We could base the resulting token off of the h5py filename, variable name, etc.

howff commented 5 months ago

Awesome deductive powers @djhoese - thanks for the amazingly fast investigation and workaround :-)

mraspaud commented 5 months ago

This dset object is an h5py Dataset object and is passed to dask's from_array. Dask is able to realize this is array-like and will continue on with the h5py Dataset object as far as I can tell. BUT the .name it generates changes every time it calls. The name is what is used as the label and identifier to the dask tasks and tells you if a computation can be reused/shared or is the same as another task. It is also used when a SwathDefinition is hashed and dask arrays are involved (as a shortcut).

What happens is that dask's tokenize function tries to hash the h5py Dataset object, realizes it is an object but not one it knows how to hash/tokenize, and ends up generating a "random" UUID4.

As far as I can tell one solution would be to try to add a __dask_tokenize__ method to the h5py object after the fact (if it lets us) and dask will find that and use it. We could base the resulting token off of the h5py filename, variable name, etc.

How about putting an lru_cache around that function maybe? so when the same data is requested multiple time, the (same) cached version is returned.

djhoese commented 4 months ago

@mraspaud That may be an option if we're careful. I was initially worried because of the hdf5 objects involved, but it looks like a string key is provided to the function, the self.file_content[key] isn't actually used, and then only self.filename (another string or Path-like object) is used I think. I think we could cache the resulting dask array portion of the function...if that is dask-friendly. Or as a last resort we could generate the dask array once and return the token generated by dask and force that when calling from_array.

howff commented 2 weeks ago

Thank you :-)