usgs / groundmotion-processing

Parsing and processing ground motion data
Other
54 stars 41 forks source link

Duplicate code for computing station metrics and using parallel processing #1017

Open baagaard-usgs opened 1 year ago

baagaard-usgs commented 1 year ago

Computing station metrics is one of the slowest steps when processing datasets (if you use multiple processes), because it runs on a single process. In looking to see how we might make it run in parallel, I believe I found duplicate code for computing station metrics.

We have the subcommand code subcommands.compute_station_metrics._event_station_metrics. This code would need refactoring to run in parallel.

There is also station_summary.StationSummary.compute_station_metrics. This one looks like it might be setup to be able to be called in parallel as the code layout appears to match compute_waveform_metrics.

@emthompson-usgs Can you look at these routines to see if there are significant differences? The subcommand code looks like it is newer, so it may do things the StationSummary code does not.

emthompson-usgs commented 1 year ago

Yes, I think the subcommand code is the newest. I think that when I refactored it I forgot to delete the stuff in the StationSummary class. I thought that I had determined that the parallelization stuff wasn't helping and that this step was actually quite fast. If that is not the case, then we should definitely revisit this issue.

baagaard-usgs commented 1 year ago

I have lots of cores on my machines, so using 8-20 processes speeds everything else up a lot. Computing the station metrics is definitely one of the slowest steps, so I think it is worth making it parallel.

baagaard-usgs commented 1 year ago

@emthompson-usgs Should we update StationSummary with the code in the compute_station_metrics subcommand to match the code structure for computing the waveform metrics or just refactor the subcommand so we can run it in parallel?