Closed baagaard-usgs closed 2 years ago
I think we want to keep command line usage for a small project simple with no extra dependencies. The workflow for large datasets that involve databases and cloud computing or clusters can be done provided in a separate repo. However, it would be nice if the processing workflows for the two cases are consistent.
The workflow that makes the most sense to me is:
This would suggest that we remove the loop over earthquakes in each subcommand and provide a high-level script to do the looping over earthquakes and call the desired subcommands to be run for each event.
@mhearne-usgs What tools do you have in mind for managing processing in the cloud? celery+RabbitMQ?
Plan to address this:
autoprocess
subcommand that combines the download
, assemble
, process_waveforms
, compute_station_metrics
, computer_waveform_metrics
, generate_report
, and generate_station_maps
steps. gmrecords
command.
Currently, gmrecords scans the data directory for all earthquakes and runs a subcommand on all of those earthquakes. This workflow can be difficult for large datasets. If something goes wrong for one earthquake, it is difficult to pick up from where one left off after resolving the issue. Additionally, it is not easy to add more earthquakes and process the records.
A more efficient and extensible workflow would be to run all of the processing steps (fetch, assemble, process, compute metrics, ...) for one event. This can be done using the current code, but requires a high-level script to keep track of the earthquakes to process. I think we should reconsider the default behavior and whether we should include a high-level script to manage processing large datasets using a queue or scheduler. This seems to be consistent with some of Mike's ideas for automated processing in the cloud. It might also permit massively parallel processing of data (a scheduler farms out events to lots of compute nodes that each process an event using multiple processes/threads).