Make config a metadata object

kujaku11 commented 2 years ago

Make the config objects from mt_metadata.base.Base for easier translation between TF and MTH5 to store transfer functions and help keep track of the config parameters.

@kkappler I will try to factor the current config but will ping you for questions and guidance on how to do this.

kujaku11 commented 2 years ago

@kkappler

Looking at config a factorization might be:

window
- num_samples
- overlap
- family
  - alpha
  - kwargs
  - sample rate
prewhitening
- type
detrend
- type
iterator
- max_iterations
- max_redescending_iterations
estimation
- engine
- estimate_per_channel
- input_channels
- output_channels
- reference_channels
- decimation
- factor
- level_id
- method
- window (object)
- prewhitenting (object)
- detrend (object)
station
- mth5_path
- id
run_config
- id
- station to process (station object)
- reference (station object)
- initial_sample_rate
- channel_scale_factors
- decimation (object)

kujaku11 commented 2 years ago

duplicate of #30

kujaku11 commented 2 years ago

@kkappler I've mocked up some metadata classed based on existing config files. The end result is below. Check it and see what you think, and if you have time check out the base classes under aurora.config.metadata. I got the result below by doing:

from aurora.config import Processing

p = Processing()
p.read_emtf_bands(r"aurora\aurora\config\emtf_band_setup\bs_256_26.cfg")
print(p.to_json())

{
    "processing": {
        "decimations": {
            "1": {
                "decimation_level": {
                    "anti_alias_filter": "default",
                    "bands": [
                        {
                            "band": {
                                "decimation_level": 1,
                                "frequency_max": 0,
                                "frequency_min": 0,
                                "index_max": 55,
                                "index_min": 47
                            }
                        },
                        {
                            "band": {
                                "decimation_level": 1,
                                "frequency_max": 0,
                                "frequency_min": 0,
                                "index_max": 46,
                                "index_min": 39
                            }
                        },
                        {
                            "band": {
                                "decimation_level": 1,
                                "frequency_max": 0,
                                "frequency_min": 0,
                                "index_max": 37,
                                "index_min": 31
                            }
                        },
                        {
                            "band": {
                                "decimation_level": 1,
                                "frequency_max": 0,
                                "frequency_min": 0,
                                "index_max": 30,
                                "index_min": 25
                            }
                        },
                        {
                            "band": {
                                "decimation_level": 1,
                                "frequency_max": 0,
                                "frequency_min": 0,
                                "index_max": 24,
                                "index_min": 20
                            }
                        },
                        {
                            "band": {
                                "decimation_level": 1,
                                "frequency_max": 0,
                                "frequency_min": 0,
                                "index_max": 19,
                                "index_min": 16
                            }
                        },
                        {
                            "band": {
                                "decimation_level": 1,
                                "frequency_max": 0,
                                "frequency_min": 0,
                                "index_max": 15,
                                "index_min": 13
                            }
                        },
                        {
                            "band": {
                                "decimation_level": 1,
                                "frequency_max": 0,
                                "frequency_min": 0,
                                "index_max": 12,
                                "index_min": 10
                            }
                        },
                        {
                            "band": {
                                "decimation_level": 1,
                                "frequency_max": 0,
                                "frequency_min": 0,
                                "index_max": 9,
                                "index_min": 8
                            }
                        },
                        {
                            "band": {
                                "decimation_level": 1,
                                "frequency_max": 0,
                                "frequency_min": 0,
                                "index_max": 7,
                                "index_min": 6
                            }
                        },
                        {
                            "band": {
                                "decimation_level": 1,
                                "frequency_max": 0,
                                "frequency_min": 0,
                                "index_max": 5,
                                "index_min": 5
                            }
                        }
                    ],
                    "decimation.factor": 1.0,
                    "decimation.level": 1,
                    "decimation.method": "default",
                    "decimation.sample_rate": 1.0,
                    "extra_pre_fft_detrend_type": "linear",
                    "input_channels": [
                        "hx",
                        "hy"
                    ],
                    "output_channels": [
                        "ex",
                        "ey",
                        "hz"
                    ],
                    "prewhitening_type": "first difference",
                    "regression.max_iterations": 10,
                    "regression.max_redescending_iterations": 10,
                    "regression.minimum_cycles": 10,
                    "window.num_samples": 128,
                    "window.overlap": 32,
                    "window.type": "boxcar"
                }
            },
            "2": {
                "decimation_level": {
                    "anti_alias_filter": "default",
                    "bands": [
                        {
                            "band": {
                                "decimation_level": 2,
                                "frequency_max": 0,
                                "frequency_min": 0,
                                "index_max": 17,
                                "index_min": 14
                            }
                        },
                        {
                            "band": {
                                "decimation_level": 2,
                                "frequency_max": 0,
                                "frequency_min": 0,
                                "index_max": 13,
                                "index_min": 11
                            }
                        },
                        {
                            "band": {
                                "decimation_level": 2,
                                "frequency_max": 0,
                                "frequency_min": 0,
                                "index_max": 10,
                                "index_min": 9
                            }
                        },
                        {
                            "band": {
                                "decimation_level": 2,
                                "frequency_max": 0,
                                "frequency_min": 0,
                                "index_max": 8,
                                "index_min": 7
                            }
                        },
                        {
                            "band": {
                                "decimation_level": 2,
                                "frequency_max": 0,
                                "frequency_min": 0,
                                "index_max": 6,
                                "index_min": 6
                            }
                        },
                        {
                            "band": {
                                "decimation_level": 2,
                                "frequency_max": 0,
                                "frequency_min": 0,
                                "index_max": 5,
                                "index_min": 5
                            }
                        }
                    ],
                    "decimation.factor": 1.0,
                    "decimation.level": 2,
                    "decimation.method": "default",
                    "decimation.sample_rate": 1.0,
                    "extra_pre_fft_detrend_type": "linear",
                    "input_channels": [
                        "hx",
                        "hy"
                    ],
                    "output_channels": [
                        "ex",
                        "ey",
                        "hz"
                    ],
                    "prewhitening_type": "first difference",
                    "regression.max_iterations": 10,
                    "regression.max_redescending_iterations": 10,
                    "regression.minimum_cycles": 10,
                    "window.num_samples": 128,
                    "window.overlap": 32,
                    "window.type": "boxcar"
                }
            },
            "3": {
                "decimation_level": {
                    "anti_alias_filter": "default",
                    "bands": [
                        {
                            "band": {
                                "decimation_level": 3,
                                "frequency_max": 0,
                                "frequency_min": 0,
                                "index_max": 17,
                                "index_min": 14
                            }
                        },
                        {
                            "band": {
                                "decimation_level": 3,
                                "frequency_max": 0,
                                "frequency_min": 0,
                                "index_max": 13,
                                "index_min": 11
                            }
                        },
                        {
                            "band": {
                                "decimation_level": 3,
                                "frequency_max": 0,
                                "frequency_min": 0,
                                "index_max": 10,
                                "index_min": 9
                            }
                        },
                        {
                            "band": {
                                "decimation_level": 3,
                                "frequency_max": 0,
                                "frequency_min": 0,
                                "index_max": 8,
                                "index_min": 7
                            }
                        },
                        {
                            "band": {
                                "decimation_level": 3,
                                "frequency_max": 0,
                                "frequency_min": 0,
                                "index_max": 6,
                                "index_min": 6
                            }
                        },
                        {
                            "band": {
                                "decimation_level": 3,
                                "frequency_max": 0,
                                "frequency_min": 0,
                                "index_max": 5,
                                "index_min": 5
                            }
                        }
                    ],
                    "decimation.factor": 1.0,
                    "decimation.level": 3,
                    "decimation.method": "default",
                    "decimation.sample_rate": 1.0,
                    "extra_pre_fft_detrend_type": "linear",
                    "input_channels": [
                        "hx",
                        "hy"
                    ],
                    "output_channels": [
                        "ex",
                        "ey",
                        "hz"
                    ],
                    "prewhitening_type": "first difference",
                    "regression.max_iterations": 10,
                    "regression.max_redescending_iterations": 10,
                    "regression.minimum_cycles": 10,
                    "window.num_samples": 128,
                    "window.overlap": 32,
                    "window.type": "boxcar"
                }
            },
            "4": {
                "decimation_level": {
                    "anti_alias_filter": "default",
                    "bands": [
                        {
                            "band": {
                                "decimation_level": 4,
                                "frequency_max": 0,
                                "frequency_min": 0,
                                "index_max": 22,
                                "index_min": 18
                            }
                        },
                        {
                            "band": {
                                "decimation_level": 4,
                                "frequency_max": 0,
                                "frequency_min": 0,
                                "index_max": 17,
                                "index_min": 14
                            }
                        },
                        {
                            "band": {
                                "decimation_level": 4,
                                "frequency_max": 0,
                                "frequency_min": 0,
                                "index_max": 13,
                                "index_min": 10
                            }
                        }
                    ],
                    "decimation.factor": 1.0,
                    "decimation.level": 4,
                    "decimation.method": "default",
                    "decimation.sample_rate": 1.0,
                    "extra_pre_fft_detrend_type": "linear",
                    "input_channels": [
                        "hx",
                        "hy"
                    ],
                    "output_channels": [
                        "ex",
                        "ey",
                        "hz"
                    ],
                    "prewhitening_type": "first difference",
                    "regression.max_iterations": 10,
                    "regression.max_redescending_iterations": 10,
                    "regression.minimum_cycles": 10,
                    "window.num_samples": 128,
                    "window.overlap": 32,
                    "window.type": "boxcar"
                }
            }
        },
        "stations.local.channel_scale_factors": [],
        "stations.local.id": null,
        "stations.local.mth5_path": null,
        "stations.local.remote": false,
        "stations.remote": []
    }
}

kkappler commented 2 years ago

This looks pretty good.

I assume it wouldn't be very hard to change the schema so that, for example, prewhitening_type = "first difference" becomes

prewhitening.type = "arma"
prewhitening.ar_order = 3
prewhitening.ma_order = 3

or similar, i.e. the schema can evolve over time...

Towards practically integrating with the existing tests in aurora, we need:

To confirm the to/from json methods are working, as many exiting tests create/save a json, and then access the config via file during test processing execution
Consider whether or not the mth5 filename maybe embedded in the config. I am supporting that option currently but would be happy to deprecate it ... the filename belongs more with the DatasetDefinition part of the TFKernel than the processing config anyhow.
A way to return a merged object that combines the station information (at top level) together with the parameters at a particular decimation_level. This is done in process_mth5_run by explicitly packing local_station_id, and reference_station_id into each decimation level dict (repeatedly at each decimation level). It would be cleaner if the Processing() object could return a decimation_level with the station info embedded into the decimation_level in a one line call.

kujaku11 commented 2 years ago

@kkappler A couple of questions regarding the config:

What information do you need to translate between frequency and index for the various bands? Just the frequency array? Thinking setting frequency and index as properties such that they can update when the other is set, but you need more information and where should that information be stored?
I added a run list to station that looks like this, any thoughts? I figured the time period should be set at the run level, unless you think that down the road the time period could be set at the channel level? And should we make time period be a list of time periods in the event of masking data?

Why is the sample_rate default to -1, does that mean it doesn't exist, instead of having a 0?

[{
 "run": {
     "id": [
         "None"
     ],
     "input_channels": [
         {
             "channel": {
                 "id": "hx",
                 "scale_factor": 1.0
             }
         },
         {
             "channel": {
                 "id": "hy",
                 "scale_factor": 1.0
             }
         }
     ],
     "output_channels": [
         {
             "channel": {
                 "id": "hz",
                 "scale_factor": 1.0
             }
         },
         {
             "channel": {
                 "id": "ex",
                 "scale_factor": 1.0
             }
         },
         {
             "channel": {
                 "id": "ey",
                 "scale_factor": 1.0
             }
         }
     ],
     "sample_rate": -1.0,
     "time_period.end": "1980-01-01T00:00:00+00:00",
     "time_period.start": "1980-01-01T00:00:00+00:00"
 }
}]

For number 2 in the previous comment about file name being in the config. It would be good we could have everything in the same file to minimize inputs. What other parameters are in the transfer function kernel that aren't in the processing config?

kujaku11 commented 2 years ago

@kkappler In your decimation config there are attributes:

decimation_factor - which is the factor by which to decimate
sample_rate - sample rate after decimation?

Could we have initial_sample_rate be the original sample rate pre decimation and then sample_rate would be a property of initial_sample_rate / decimation_factor?

The reason I ask is that in some functions it requires the initial sample rate not the decimated sample rate, and it would be good for the decimation object to have that information.

kkappler commented 2 years ago

@kujaku11 sorry, I thought I had sent my reply to this already. The way it is set up now, sample_rate in the decimation_level_config is the sample_rate after decimation. As you suggest, it is derived from initial_sampling_rate which is on the top level of the config, and is ultimately sourced from the mth5.

sample_rate is actually redundant information, but I would like to keep it there because it is much more intuitive to someone inspecting the config than deducing it from the decimation_factor and the initial_sample_rate.

kkappler commented 2 years ago

@kujaku11 : What information do you need to translate between frequency and index for the various bands? Just the frequency array? Thinking setting frequency and index as properties such that they can update when the other is set, but you need more information and where should that information be stored?

Translating between frequency and index needs only sample_rate of the data and the window.num_samples, which the config already has. The other thing that is needed is a rule about frequency band edges, whether a frequency band, which is an interval is open, half-open, or closed. That rule will come from the FrequencyBand, and FrequencyBands classes.

I added a run list to station that looks like this, any thoughts? I figured the time period should be set at the run level, unless you think that down the road the time period could be set at the channel level? And should we make time period be a list of time periods in the event of masking data?

I think this is great! It occurs to me that if packaged like this, with the run_list, this representation of the config is basically an instance of a TransferFunctionKernel. It tells what data runs to process, where the data are, and has the recipe for doing the processing. The only thing potentially missing here, that the DatasetDefinition class I am fiddling with supports is the splitting of runs to excise segments that one would not want to process. I want to think about that some. I still think that including a DataFrame that lists the time series blocks as an optional argument to the process_mth5 function is worth having.

_Why is the samplerate default to -1, does that mean it doesn't exist, instead of having a 0?

I think maybe I was concerned about divide by zero exceptions. I don't think the value is too important, I just didn't want a potentially valid value being put in as a default, to protect against launching jobs that didn't explicitly get that value from somewhere (like the mth5). This is another case of a piece of redundant information, since it is in the mth5 we could access the sample_rate from the mth5, but as a user, I like the idea of that information being in the config when I am inspecting it.

kkappler commented 2 years ago

@kujaku11 I'm making good progress in integrating the new processing config based on mt_metadata.

The following issue has come up when I want to STFT: The required properties are:

taper_family
num_samples_window
num_samples_overlap
taper_additional_args
sample_rate,
prewhitening_type
extra_pre_fft_detrend_type

It would be useful if one could define a dict from the processing config via a method. For example, I would like processing_config.decimations[0].stft or processing_config.decimations[0].stft()

to return me either a dict or an object that has exactly the params I listed above.

The way things are set up now, each atom of metadata is defined in exactly one of the standards.json files, and I think this is what we want, BUT ideally, we want the ability to define custom methods that return user-defined mixtures of these atoms, either at the aurora.config.metadata.processing.Processing or aurora.config.metadata.decimation_level.DecimationLevel layers.

Also, FWIW, the taper_family I can see is supported as dec_level_config.window.type But we also need to add to window additonal_args which should default to an empty dictionary

kkappler commented 2 years ago

decimation.factor should be forced to be an integer

kkappler commented 2 years ago

@kujaku11 The remote_reference test on synthetic is now passing locally with the new Processing class.

Two things to discuss:

The Processing class does not have fields for "reference channel ids". I am instead using dec_config.input_channels for this. Normally this should be fine, the remote will usually use hx, hy, same as the local input channels. This is worth a quick email to Gary however, because I think it is possible that the remote reference can (theoretically) use electric channels
The remote reference station is list-like in the Processing class. This makes sense to define a processing "campaign" of sorts, but for a single TF estimate, we will not in general want to mix reference stations (I think) and so from a TF Kernel perspective there should only be a single RR.

kkappler commented 2 years ago

This is working now. All tests are passing. The synthetic results are in agreement with EMTF to within 1e-4 in both rho and phi.

The parkfield results change slightly, for neither better nor worse. Here is the RR results from the old processing config, etc: RR_20220528_old

And Here from the new: RR_20220528_new

Bascially, the phases are a little better around 10s now, but the apparent res is a little more different (but still negligably so) from EMTF.

simpeg / aurora

Make config a metadata object #153