usgs / groundmotion-processing

Parsing and processing ground motion data
Other
54 stars 42 forks source link

Make GMrecordsApp scriptable #885

Closed baagaard-usgs closed 2 years ago

baagaard-usgs commented 2 years ago

Refactor GMrecordsApp so that it is scriptable.

Why GMrecordsApp is not scriptable

  1. There is no mechanism to request that the application run a subcommand without using command line arguments.
  2. The GMrecordsApp constructor includes parsing the command line arguments.

Proposed solution

Refactor the constructor and main() so that keyword arguments can be passed to main() so that the arguments provide equivalent behavior to using command line arguments.

In most cases this simply requires:

def main(self, **kwargs):
    self.args = argparse.Namespace(**kwargs) if kwargs else self._parse_command_line()

In this case we need to move calling _parse_command_line() (and everything after that) from the constructor to main(). I propose adding an initialize() method to do the stuff that comes after calling _parse_command_line() in the constructor so as not to clutter main().

jmfee-usgs commented 2 years ago

Another option is to refactor one step further:

baagaard-usgs commented 2 years ago

@jmfee-usgs What do you mean by gmrecords()? The constructor or a separate function for executing the desired subcommand?

The subcommands take a GMrecordsApp argument and use its args data member, so it looks like it would take a fair amount of work to disentangle the arguments from GMrecordsApp and the subcommands.

jmfee-usgs commented 2 years ago

@baagaard-usgs Yes, that would be a fair amount of work and a lot depends on the intended usage (cli only vs api).

Looking at the compute_station_metrics subcommand, it's main method calls self._get_events() (from base) and then uses self.events. In base, it calls base_utils.get_events with kwargs which is closer to what I'd expect as an api user. This requires some effort to trace, and seems tightly coupled.

Not sure if users are writing their own scripts and if there are utils classes or similar for computing station metrics (or other subcommands) without using the cli? Disentangling may be worth the effort.

Thinking along the lines of separate methods/classes that would all be coordinated by main based on args, but could be used independently to:

baagaard-usgs commented 2 years ago

My main use case is being able to easy run all processing steps with one command. I also like to have everything in a Python script under version control to document all the steps and parameters.

#!/usr/bin/env python3

import os
import pathlib

from gmprocess.apps.gmrecords import GMrecordsApp

NUM_PROCESSES = 0
DATA_DIR = "data-waveforms"
DATA_SETS = ("cesmd", "fdsn")

STEPS = (
    "assemble",
    "process_waveforms",
    "compute_station_metrics",
    "compute_waveform_metrics",
    "export_metric_tables",
    "generate_report",
    )

cwd = os.getcwd()
for dset in DATA_SETS:
    work_dir = pathlib.Path(DATA_DIR) / dset
    os.chdir(work_dir)
    app = GMrecordsApp()
    app.load_subcommands()
    args = {
        "debug": False,
        "quiet": True,
        "eventid": None,
        "textfile": None,
        "overwrite": False,
        "num_processes": NUM_PROCESSES,
        }
    for step in STEPS:
        print(f"Running '{step}'...")
        step_args = {
            "subcommand": step,
            "func": app.classes[step]["class"],
            "log": f"{step}.log",
            }
        args.update(step_args)
        app.main(**args)
os.chdir(cwd)
jmfee-usgs commented 2 years ago

Agreed on running all steps in one command/script, and being able to customize that without leaving python. It's also useful to not require the filesystem for every step. Filesystem is always an option, but separating IO from processing logic makes the code easier to integrate with databases or run in a web service.

Not familiar with all the details, but something along these lines:

import logging
import os
import pathlib

from gmprocess.steps import assemble, compute_station_metrics, ...

DATA_DIR = "data-waveforms"
DATA_SETS = ("cesmd", "fdsn")

def custom_process(work_dir):
    try:
        for event in load_events(dir=work_dir):
            waveforms = assemble(event=event, log="assemble.log")
            processed = process_waveforms(waveforms)
            metrics = compute_station_metrics(processed)
            export_metric_tables(metrics, dir=work_dir)
            report = generate_report(processed, metrics)
    finally:
        os.chdir(cwd)

if __name__ == '__main__':
    logging.basicConfig(level=logging.INFO)

    data_dir = pathlib.Path(DATA_DIR)
    for data_set in DATA_SETS:
        work_dir = data_dir / data_set
        custom_process(work_dir)