usgs / groundmotion-processing

Parsing and processing ground motion data
Other
54 stars 42 forks source link

Earthquake-centric workflow #1054

Closed baagaard-usgs closed 2 years ago

baagaard-usgs commented 2 years ago

Currently, gmrecords scans the data directory for all earthquakes and runs a subcommand on all of those earthquakes. This workflow can be difficult for large datasets. If something goes wrong for one earthquake, it is difficult to pick up from where one left off after resolving the issue. Additionally, it is not easy to add more earthquakes and process the records.

A more efficient and extensible workflow would be to run all of the processing steps (fetch, assemble, process, compute metrics, ...) for one event. This can be done using the current code, but requires a high-level script to keep track of the earthquakes to process. I think we should reconsider the default behavior and whether we should include a high-level script to manage processing large datasets using a queue or scheduler. This seems to be consistent with some of Mike's ideas for automated processing in the cloud. It might also permit massively parallel processing of data (a scheduler farms out events to lots of compute nodes that each process an event using multiple processes/threads).

baagaard-usgs commented 2 years ago

I think we want to keep command line usage for a small project simple with no extra dependencies. The workflow for large datasets that involve databases and cloud computing or clusters can be done provided in a separate repo. However, it would be nice if the processing workflows for the two cases are consistent.

The workflow that makes the most sense to me is:

  1. Generate the list of earthquake event ids (outside gmprocess)
  2. Loop over earthquakes. For each earthquake: download, assemble, process, compute metrics, generate event reports
  3. Export tables

This would suggest that we remove the loop over earthquakes in each subcommand and provide a high-level script to do the looping over earthquakes and call the desired subcommands to be run for each event.

@mhearne-usgs What tools do you have in mind for managing processing in the cloud? celery+RabbitMQ?

emthompson-usgs commented 2 years ago

Plan to address this: