Factor and Robustify ProcessingConfig

kkappler commented 3 years ago

There are several topics to address and they may be made into separate issues. Ultimately what we require of the config is that all user controlled parameters are exposed.

A Validation:

Currently the processing config classes are instances of BaseDict in mt_metadata/mt_metadata/base/schema.py, e.g.

from mt_metadata.base import BaseDict
class DecimationLevelConfig(BaseDict):
    def __init__(self):
       ...

class RunConfig(BaseDict):
   def __init__(self):
       ...

To inherit the validators we would need to use mt_metadata.base.metadata.Base()

Validation methods can check things like -bands are consistent with windowing scheme & sample rate -minimum_number_of_cycles is reasonable, else warn -optional values not used are set to None, [ ], "", etc

B Reproducible Results The config also supports having the mth5_path in it, but we can use this by either:

config has mth5_path and run. Pipeline takes config only, or
Pipeline takes (config, mth5, run)

Regardless of which of these two options that are used to bind the processing config to its dataset at the start of the pipeline, the mth5_path should get added to the config during processing and ultimately wind up stored with the output TF. Currently, none of the codes (matlab or EMTF) store a complete list of parameters with the output, blocking tests of reproducability.

C Within Config Interfaces There are two tiers in the config data structure. We call these the Processing Run Tier and the Decimation Level Tier. At the Processing Run Tier are information that is globally relevant. Things like where the data are stored, the names of the local and remote reference stations, the sampling rate of the data in the mth5, etc.

At the Decimation Level Tier are specific parameters related to processing time series at a fixed sampling rate.
Most parameters on the Decimation Level Tier are actually repeated at each decimation level.

It may be worth updating the code that manages the Processing Configuration to allow most of the Decimation Level Tier values to be specified at the top level. The reason to use two levels was in order to emulate EMTF. The most likely values to be different at the Decimation Level Tier in practice are the specification of frequency bands.

Also, A Sub-config export method should be considered. This is because many attributes of the config are accessed only by certain routines, and there is a minimalist satisfaction in not passing extraneous information around when it is not required. A nice approach to this would be to have export methods, governed by a keyword so that when we export a keyword-governed config it contains only the parameters needed by the process denoted by that keyword. For example, the config may contain max_number_of_iterations, and max_number_of_redescending_iterations, but there is no need to share these parameters with a time domain windowing function, where as num_samples_overlap and num_samples_window are required... a system that extracts from the global config to individual process configs would make the passed data structures simpler -- that said we try to use simple dictionaries, and having non-accessed entries does not actually negatively impact any of the methods (yet)

D Formalize the Processing Config schema. This means at a minimum, all parameters available to control are listed and their significance in the processing pipeline is defined.

kkappler commented 3 years ago

Also, the way that arguments are passed from the config should probably be as a dictionary, so that the individual methods in the pipeline receive **processing_confg_params as an argument.

kkappler commented 2 years ago

it is possible that this issue, while not blocked by TF kernel definition, is at least entangled with it ...

simpeg / aurora

Factor and Robustify ProcessingConfig #30