pytroll / satpy

Python package for earth-observing satellite data processing
http://satpy.readthedocs.org/en/latest/
GNU General Public License v3.0
1.07k stars 297 forks source link

With DataQuery loading, in the face of ambiguity, refuse the temptation to guess #1721

Open gerritholl opened 3 years ago

gerritholl commented 3 years ago

Feature Request

Is your feature request related to a problem? Please describe.

When a DataQuery matches multiple DataIDs, and I use the DataQuery to load a dataset, only one dataset is loaded. When multiple matching datasets are loaded and the DataQuery matches multiple matching datasets, selecting a key will guess which one the user means. For example:

from satpy import Scene, DataQuery
from glob import glob

seviri_files = glob("/media/nas/x21308/scratch/SEVIRI/202103300900/H-000*")

sc = Scene(filenames={"seviri_l1b_hrit": seviri_files})
wl = 0.8  # µm
dq = DataQuery(wavelength=wl)
matching = [x for x in sc.available_dataset_ids() if wl in x["wavelength"]]
print(len(matching), "matching")
sc.load([dq])
print(len(sc.keys()), "loaded")
sc.load(matching)
print(len(sc.keys()), "loaded")
print(sc[dq].attrs["_satpy_id"])

Running this gives:

6 matching
1 loaded
6 loaded
DataID(name='HRV', wavelength=WavelengthRange(min=0.5, central=0.7, max=0.9, unit='µm'), resolution=1000.134348869, calibration=<calibration.reflectance>, modifiers=())

Satpy is guessing what I want. This violates the Zen of Python:

$ python -m this
The Zen of Python, by Tim Peters

...
In the face of ambiguity, refuse the temptation to guess.
...

Describe the solution you'd like

I would like to force users to explicitly describe rules on what to do in face of ambiguity. There should be a global configuration that will be referred to whenever a user request is ambiguous. In this configuration, the user can configure preferences rules for resolution, wavelength, polarisation, or other keys. If the request is still ambiguous with this configuration, satpy should raise an exception.

The design of such a preferences configuration requires careful thought.

Describe any changes to existing user workflow

Perhaps the global configuration could default to the current behaviour, but considering recent discussions, I'm not sure if users even agree on what they would expect if loading 0.8 µm in SEVIRI, so maybe we should really force the user to be explicit (either in configuration or in call). The latter would break backward compatibility.

Additional context

Inspired by a discussion on slack.

djhoese commented 3 years ago

This is difficult for me. While I've always disliked "magic" decisions in software, this one felt like a good thing when we first started doing it. You ask for MODIS Band 2 and you'll always get the highest spatial resolution version of band 2. Depending on what files you provide that could be 500m, could be 1km, could be 2km.

With wavelength it is difficult, especially for SEVIRI, where HRV is the higher resolution but maybe has its own complications with geolocation and how users can use it. For example, using HRV instead of VIS008 when loading a composite that depends on 0.8um means you probably have to do some real resampling (not 'native'). So this "magic" actually makes Satpy and the data harder to use in this case.

I like the idea of people being able to configure the behavior, but fear that it will end up with issues just like this one. We intend for behavior configuration to make things easier but now people are getting the "wrong" composite combination of products because the behavior picked a "worse" band when the composite author assumed the "better" band would be chosen.

With that in mind, it kind of makes me think we should get rid of generic visir.yaml (Vis/IR) composites altogether and require that authors copy the composites to each sensor-specific YAML config with the actual DataID elements needed to completely select the right product (but continue with the current magic so higher resolution versions of products can be chosen if available).

...at least that's how I feel this instant.

pnuu commented 3 years ago

With wavelength it is difficult, especially for SEVIRI, where HRV is the higher resolution but maybe has its own complications with geolocation and how users can use it. For example, using HRV instead of VIS008 when loading a composite that depends on 0.8um means you probably have to do some real resampling (not 'native'). So this "magic" actually makes Satpy and the data harder to use in this case.

The complication with HRV isn't the resampling, but the completely different spectral behaviour so the products will not be even similar to each other if VIS008 is replaced with HRV. To lesser extent the different geographical coverage can also cause surprises.

With that in mind, it kind of makes me think we should get rid of generic visir.yaml (Vis/IR) composites altogether and require that authors copy the composites to each sensor-specific YAML config with the actual DataID elements needed to completely select the right product (but continue with the current magic so higher resolution versions of products can be chosen if available).

In operations I've always specified the composites in instrument-specific YAML files using the channel names to have as little ambiguity as possible. For a non-operational user this might not be the easiest route as they would need to map the wavelengths to the channel names. But if it saves them from these surprises, I'd say it's a good tradeoff, even if the composites would be effectively duplicated for each sensor in use.

mraspaud commented 3 years ago

I agree that ambiguity is not good, and also that all the builtin composites should use band names to avoid confusion for the known instruments. At the same time, it is nice to have "generic" composites based on wavelengths only that would work directly when implementing a new reader. And that the highest resolution is chosen by default in these cases.

Is this niceness sufficient to introduce a possibly confusing guessing mechanism? I don't know. To be continued...

gerritholl commented 3 years ago

If an instrument has a channel with a higher spatial resolution and a broader spectral band, and another channel with a lower spatial resolution but a narrower spectral band, then I don't know what users prefer or what should be used in an RGB that specifies only wavelength. An example is HRV, which is what triggered this discussion. There may be other instruments sharing this property.

mraspaud commented 3 years ago

I agree. But I was also thinking about MODIS with multiple resolutions for the same channel.