Allow readers to filter the available datasets configured in YAML

djhoese commented 6 years ago

Is your feature request related to a problem? Please describe. The native_msg reader only accepts one file at a time but these files can differ with what datasets/bands are actually stored in them. There are currently 3 ways to specify/modify what datasets are available from a reader:

Configure them in YAML. If the file_type for a dataset has been loaded then we assume this dataset can be loaded.
Add an available_datasets method to the file handlers of a reader that produce a dictionary of ds_id -> ds_info. This is good when the files being read have a long list of dynamically loaded variables.
Specify them in the YAML (1 above) then specify a resolution property on the specific file handler to modify the DatasetID for that specific dataset to specify the exact resolution available from this file.

Describe the solution you'd like Add another method to file handlers filter_available_datasets that is called before available_datasets. This new method allows a file handler to indicate "I know this dataset is configured in the YAML but I don't actually have that data in this file".

Describe any changes to existing user workflow Shouldn't be a problem if things are done properly.

Additional python or other dependencies None.

Describe any changes required to the build process None.

Describe alternatives you've considered Move the YAML config datasets to the available_datasets method. This makes it really hard to configure things like bands where wavelengths are probably best not hardcoded (unless satpy stops using yaml to configure datasets/readers).

Could also just deal with it the way it is meaning datasets are listed as available, but aren't actually available.

CC'ing @sjoro @adybbroe @ColinDuff

djhoese commented 6 years ago

What if filter_available_datasets was given the DatasetID from the YAML file and the dataset_info dictionary from the YAML. It would then either return (None, None) to say the dataset isn't available or a (DatasetID, dict) where metadata may have been modified or resolution or other DatasetID attributes resolved.

Or maybe the method in the file handler should get the original dictionary from the reader (instead of being called multiple times with the individual DatasetID/info arguments?

djhoese commented 6 years ago

So I double checked this to make sure I wasn't crazy after discussing things in #437 with @sfinkens. I think this may have pointed out an overall limitation of the base reader that I need to address. I would like @mraspaud's opinion on this too (@pnuu and @adybbroe too if you guys care).

Currently, satpy has the preference to produce as much of the products that were requested by the user as possible. That means that if they ask for rgb1, rgb2, rgb3 and one or more of those can't be generated for any normal reason then the others are still generated and saved. This (IMO) is preferred by most users who use satpy for operational uses. You don't want a bad value or a bug in one product to stop the production of all products (public users don't like that).

Right now readers have a list of all known dataset IDs created from two sources:

The YAML file's datasets section.
The available_datasets() for the first file handler of each file type.

The question that is coming up in #437 is what is the difference between "available" and "exists" in the readers/file handlers. Right now the reader's don't necessarily have the concept of this is a dataset I know about but is not available for loading except at the file type level where we can say "we don't have the M15 files so M15 isn't available for loading". The point of this issue is to allow the file handlers to say "this dataset that was configured in YAML isn't available in this file handler right now".

For example, the geocat reader is configured to load B01-B16 for AHI data files which is configured in the YAML so that wavelength and other metadata can be specified easily. It also provides dynamic datasets from the file handler like pixel_surface_type. If the file doesn't have B16 then I would expect scn.available_dataset_names() to not list that band, that's what #437 is meant to fix. However, the current implementation in that PR handles available_dataset_names() but it also causes scn.load(['B16']) to raise a KeyError. This is not how other readers in satpy work.

I think I would like to have the yaml reader know the difference between possible (all) datasets and available datasets. Thoughts? (damn this got long)

sfinkens commented 6 years ago

Thank you for breaking this down very clearly @djhoese! Maybe examples of the two different scenarios might be helpful as well. The current behaviour is:

a) Dataset not available (no 10_7 files provided)

In [9]: filenames = ['goes15.2018.001.003019.BAND_02.nc']

In [10]: scene = Scene(filenames=filenames, reader='nc_goes')
[DEBUG: 2018-11-13 08:16:51 : satpy.scene] Setting 'PPP_CONFIG_DIR' to '/cmsaf/nfshome/sfinkens/software/devel/satpy/satpy/etc'
[DEBUG: 2018-11-13 08:16:51 : satpy.readers] Reading ['/cmsaf/nfshome/sfinkens/software/devel/satpy/satpy/etc/readers/nc_goes.yaml', '/cmsaf/nfshome/sfinkens/software/devel/satpy/satpy/etc/readers/nc_goes.yaml', '/cmsaf/nfshome/sfinkens/software/devel/satpy/satpy/etc/readers/nc_goes.yaml']
[DEBUG: 2018-11-13 08:16:51 : satpy.readers.yaml_reader] Assigning to nc_goes: ['goes15.2018.001.003019.BAND_02.nc']
[DEBUG: 2018-11-13 08:16:51 : satpy.composites] Looking for composites config file goes_imager.yaml
[DEBUG: 2018-11-13 08:16:51 : satpy.composites] Looking for composites config file visir.yaml

In [11]: scene.available_dataset_names()
Out[11]: ['03_9', 'latitude_03_9', 'longitude_03_9']

In [12]: scene.load(['10_7'])
[WARNING: 2018-11-13 08:16:58 : satpy.readers.yaml_reader] Required file type 'nc_goes_10_7' not found or loaded for 'latitude_10_7'
[WARNING: 2018-11-13 08:16:58 : satpy.readers.yaml_reader] Required file type 'nc_goes_10_7' not found or loaded for 'longitude_10_7'
[WARNING: 2018-11-13 08:16:58 : satpy.readers.yaml_reader] Required file type 'nc_goes_10_7' not found or loaded for '10_7'
[WARNING: 2018-11-13 08:16:58 : satpy.scene] The following datasets were not created: DatasetID(name='10_7', wavelength=(10.2, 10.7, 11.2), resolution=None, polarization=None, calibration='brightness_temperature', level=None, modifiers=())

b) Dataset unknown

>>> scene.load(['foo'])
KeyError: 'Unknown datasets: foo'

The implementation in #437 throws a KeyError of type b) if a known dataset is not available in the file which erroneously suggests that the dataset is unknown.

djhoese commented 6 years ago

@sfinkens Yes exactly. I get something different for case a. For VIIRS SDRs where I don't provide any M files I get:

In [3]: scn = Scene(reader='viirs_sdr', filenames=glob('/data/data/viirs/conus_day/*I*t1801*.h5'))                                                                                                
DEBUG:satpy.scene:Setting 'PPP_CONFIG_DIR' to '/Users/davidh/repos/git/satpy/satpy/etc'
DEBUG:satpy.readers:Reading ['/Users/davidh/repos/git/satpy/satpy/etc/readers/viirs_sdr.yaml', '/Users/davidh/repos/git/satpy/satpy/etc/readers/viirs_sdr.yaml', '/Users/davidh/repos/git/satpy/satpy/etc/readers/viirs_sdr.yaml']
DEBUG:satpy.readers.yaml_reader:Assigning to viirs_sdr: ['/data/data/viirs/conus_day/GITCO_npp_d20120225_t1801245_e1802487_b01708_c20120226001734123892_noaa_ops.h5', '/data/data/viirs/conus_day/SVI01_npp_d20120225_t1801245_e1802487_b01708_c20120226002130255476_noaa_ops.h5', '/data/data/viirs/conus_day/SVI02_npp_d20120225_t1801245_e1802487_b01708_c20120226002313514280_noaa_ops.h5', '/data/data/viirs/conus_day/SVI03_npp_d20120225_t1801245_e1802487_b01708_c20120226002348811096_noaa_ops.h5', '/data/data/viirs/conus_day/SVI04_npp_d20120225_t1801245_e1802487_b01708_c20120226002313755405_noaa_ops.h5', '/data/data/viirs/conus_day/SVI05_npp_d20120225_t1801245_e1802487_b01708_c20120226002349115396_noaa_ops.h5']
DEBUG:satpy.composites:Looking for composites config file viirs.yaml
DEBUG:satpy.composites:Looking for composites config file visir.yaml

In [4]: scn.available_dataset_names()                                                                                                                                                             
Out[4]: ['I01', 'I02', 'I03', 'I04', 'I05', 'i_latitude', 'i_longitude', 'satellite_azimuth_angle', 'satellite_zenith_angle', 'solar_azimuth_angle', 'solar_zenith_angle']

In [5]: scn.load(['M01'])                                                                                                                                                                         
WARNING:satpy.readers.yaml_reader:Required file type 'svm01' not found or loaded for 'M01'
WARNING:satpy.readers.yaml_reader:Required file type 'svm01' not found or loaded for 'M01'
WARNING:satpy.readers.yaml_reader:Required file type 'gmtco' not found or loaded for 'm_longitude'
WARNING:satpy.readers.yaml_reader:Required file type 'gmtco' not found or loaded for 'm_latitude'
WARNING:satpy.readers.yaml_reader:Required file type 'svm01' not found or loaded for 'M01'
WARNING:satpy.scene:The following datasets were not created: DatasetID(name='M01', wavelength=(0.402, 0.412, 0.422), resolution=742, polarization=None, calibration='reflectance', level=None, modifiers=('sunz_corrected',))

What reader are you using for 10_7? Is 10_7 configured in the YAML?

ColinDuff commented 6 years ago

For the native_msg reader we determine what actual data is present in the file when we read the header. And if the user tries to load a channel that isnt present they will get as key error. It cant be difficult to then do as you suggest in that the available_datasets method returns what is actually present and not all possible datasets. or have i misunderstood.

djhoese commented 6 years ago

@ColinDuff The file handler may raise a KeyError but this is only logged as a warning and the base/parent Reader object in yaml_reader.py will continue processing other requests. At least I think that's how it works iirc.

This issue was made to address that with the native_msg reader right now it shows all datasets are available with scn.available_dataset_names() but then things aren't actually available when you try to load them. Normally this would be handled with a call to the file handlers available_datasets() method but since the native_msg reader has them all defined in YAML it doesn't know what datasets it should know about (if that makes sense).

ColinDuff commented 6 years ago

ok so i can have a look at how the native_msg (and netcdf msg) will only show what channels are actually available in the file , if you wish.

djhoese commented 6 years ago

That's what this issue is meant to discuss, the right way to do that. I understand why the native_msg reader is the way it is. I'm hoping to get some feedback on if there is some situation I'm not considering. I think trying to support the readers as I've described is to have the main reader know all of the datasets that could be loaded by this reader and also ask the file handlers what is available. This would mean that two lists would have to be maintained I think.

This would preserve the functionality of warning for existing but unavailable datasets while improving the capabilities of file handlers to filter/limit what datasets are considered available.

sjoro commented 6 years ago

This issue supersedes an older one #263

sfinkens commented 6 years ago

@djhoese Sorry, I pasted an additional line scene.show('10_7') which triggered the KeyError in case a). scene.load(['10_7']) indeed only issues a warning. I edited the example above.

ColinDuff commented 6 years ago

Talking with Sauli and just thinking out loud - available_datasets could be renamed to known_datasets (or whatever) that lists what channels are listed in the yaml file. now available_datasets would be returned from the file reader listing what has actually been read. available_composites does(would??) use this to determine what composites are currently available and only return those?. For readers that work on a channel per file basis, or those ones that dont need to carry out a channels available check, it would just set available_datasets to known_datasets after a succesful read.

mraspaud commented 6 years ago

Interesting discussion. Two things:

available_dataset_names should just give what is available with the files at hand. If necessary, we could implement a reader_dataset_names to get the names of all datasets configured for this reader that would be gotten from the yaml file.
About how to implement this, I would like to use the available_datasets more. So there should be a default one that just uses the yaml info (in the same way we do today) and then require the reader to override it with it's own version when relevant (eg the native_msg). For this, we would delegate the yaml dataset listing to the BaseFileHandler. Another solution, if we want to force the reader author to make a conscious decision about what datasets are readable would be to include a call to filter_available_dataset in the base available_dataset method, so that the reader would crash unless the filtering would be implemented. The implementation could be as easy as return self.reader_datasets for the one channel-per-file readers. However, that means that we require one more method to be implemented in the reader, but maybe that's acceptable.

djhoese commented 6 years ago

First, I think we need to all refer to full class.method combinations. There is a single Reader object that uses multiple file handler objects. I'll try to keep my answers below clear (pointing out which methods come from what objects). This is also kind of a brainstorm. You've been warned...

@mraspaud For your first point, there is a Scene.all_dataset_ids/names method that serves this purpose.

For your second point, so you are thinking the full list of YAML datasets gets passed to the file handler and then the reader asks the file handlers what is available. This available list could include the YAML datasets and any dynamically discovered datasets from the file. Right? I think my only complaint about this is that it complicates the __init__ of the file handlers making it even more difficult for new users to make a reader. Going with an optional overridden method in the file handler might make it easiest for the user, which means that the file handler probably shouldn't hold on to what is known in the YAML. That said, is there anything that file handlers could do with a full list of YAML datasets that they can't do now or might want to do in the future?

Making this functionality available through an optional method means that most people won't have to worry about it. The previously mentioned PR does a good job of solving both problems in one method when it should probably solve it in two. So I propose as a little-change for existing readers/handlers:

handler.available_datasets() returning iterator of dynamically generated (DatasetID, dataset_info). This already exists and works this way. Maybe change this to discovered_datasets() or dynamic_datasets()?
handler.update_dataset_id(ds_id, dataset_info) which returns a new DatasetID (or the provided one) with things like resolution updated based on the file contents. I'd also be open to it updating the dataset_info in place, returning it, and the reader generating the new DatasetID from that. This better serves @mraspaud hope to one day get rid of the DatasetID in favor of metadata-only queries.
handler.filter_dataset(ds_id, dataset_info) which returns True or False if the dataset is available in this file.

The above would require some "caching" of the available datasets versus all/known datasets. This would however speed up Scene.available_dataset_ids() and similar methods since the file handlers wouldn't have to be consulted. This would also make reader.get_dataset_key accurate which is used to determine if a product is known (when a user does scn.load(['composite', 'band'])).

For a "perfect" brand new solution I propose:

handler.available_datasets(configured_datasets=None) which returns an iterator of (DatasetID, ds_info). This serves the purpose of all three methods described above (produce dynamic datasets, update properties of yaml configured datasets, and filter out any yaml configured datasets that don't exist in this file).

sfinkens commented 6 years ago

I'm wondering how you would avoid the KeyError in Scene.load? I guess datasets which are not available would have to be removed from Scene.datasets before. But where are the datasets added to Scene.datasets in the first place?

mraspaud commented 6 years ago

Sounds good, looking forwards to reviewing the PR for this :)

djhoese commented 6 years ago

@sfinkens What happens is that the reader calls the file handler's get_dataset method. If it gets a KeyError then it continues processing the other bands. All of the datasets that are loaded from the reader are then given to the Scene and Scene.datasets is updated. Scene.datasets is a dictionary of the datasets that have been loaded/generated so unavailable datasets would not appear here.

djhoese commented 6 years ago

I need some more opinions if anyone is on their computer this weekend. I was about to implement the "perfect" solution above and can't decide if it is overly complicated or not. Having a single method of the file handlers be responsible for updating yaml-configured dataset info and producing any dynamically discovered datasets isn't too bad. The complexity comes from this method telling whether or not these are available or if they are just "known". I see two options:

Option 1 - All In One


def available_datasets(self, configured_datasets=None):
    for ds_info in (configured_datasets or []):
        # update ds_info with things like file resolution, etc if needed
        yield self._dataset_is_available(ds_info), ds_info
    for var_name in dynamic_var_names:
        yield True, {'name': var_name, ... other metadata ...}

Option 2 - Split

def all_datasets(self, configured_datasets=None):
    for ds_info in (configured_datasets or []):
        # update ds_info with things like file resolution, etc if needed
        yield ds_info
    for var_name in dynamic_var_names:
        yield {'name': var_name, ... other metadata ...}

def available_dataset(self, ds_info):
    return ds_info['name'] in self

djhoese commented 6 years ago

Eh the second option is redundant. I'm still not a huge fan of the first, but not terrible.

sfinkens commented 6 years ago

I like option one! What would be the default be haviour of available_datasets? Raising a NotImplementedError and providing the above template in the docstring to guide users how to use the method? Or maybe return (True, ds_info) for all configured datasets?

djhoese commented 6 years ago

@sfinkens The latter. So the base file handler class would do:

for ds_info in (configured_datasets or []):
    yield True, ds_info

pytroll / satpy

Allow readers to filter the available datasets configured in YAML #434

Option 1 - All In One

Option 2 - Split