usgs / groundmotion-processing

Parsing and processing ground motion data
Other
54 stars 42 forks source link

Use iterators when accessing the workspace #710

Closed arkottke closed 2 years ago

arkottke commented 3 years ago

Is your feature request related to a problem? Please describe. During the metric processing, lists are used to load all of the waveforms and store all of the processed results. For some events, storing all of the data requires a lot of RAM. The maximum that I saw was about 3 GB, but would increase for metrics with more points. (3 GB isn't very much, but that's a lot per process/core).

Describe the solution you'd like Use iterators to loop over the wave forms and save the results as they are computed.

Describe alternatives you've considered If dask is more fully embraced this would also be resolved, but this would require a more extensive effort.

emthompson-usgs commented 3 years ago

Sorry, I don't really follow this. Can you point me to the part of the code where you are suggesting we replace lists with iterators?

arkottke commented 3 years ago

Sorry for the slow reply. Part of my delay in responding is that it is complicated. The general idea is to use iterators (also called generators) rather than lists, or if you really want to embrace dash there are dash specific containers (e.g., bag).

For example, here https://github.com/usgs/groundmotion-processing/blob/5aa9b8778d37e78d8b7140ff2c7578f0cf933e18/gmprocess/subcommands/compute_waveform_metrics.py#L77-L85 All of the streams are loaded before the processing starts. For really well recorded events this might be like 1000? time series. I would start by changing getStreams(...) https://github.com/usgs/groundmotion-processing/blob/5aa9b8778d37e78d8b7140ff2c7578f0cf933e18/gmprocess/io/asdf/stream_workspace.py#L409 to be named iterStreams(...) and instead of returning a list: https://github.com/usgs/groundmotion-processing/blob/5aa9b8778d37e78d8b7140ff2c7578f0cf933e18/gmprocess/io/asdf/stream_workspace.py#L524-L530 it would yield each stream.

As you can see this would be a significant refactoring, but I think it would be a good improvement especially if this tool is use on denser collections of data.

emthompson-usgs commented 3 years ago

I think I see now, and clearly it is not useful to load all the streams if we are going to hand them off to separate processes. As a first step, I'll factor out the part where we construct/load the full list of streams. I'll have it iterate over the waveforms, and load only the stream for that waveform within the loop.

emthompson-usgs commented 2 years ago

While I haven't done exactly what this issue requests, I think the changes in #765 get pretty close.