Closed arkottke closed 2 years ago
Sorry, I don't really follow this. Can you point me to the part of the code where you are suggesting we replace lists with iterators?
Sorry for the slow reply. Part of my delay in responding is that it is complicated. The general idea is to use iterators (also called generators) rather than lists, or if you really want to embrace dash there are dash specific containers (e.g., bag).
For example, here
https://github.com/usgs/groundmotion-processing/blob/5aa9b8778d37e78d8b7140ff2c7578f0cf933e18/gmprocess/subcommands/compute_waveform_metrics.py#L77-L85
All of the streams are loaded before the processing starts. For really well recorded events this might be like 1000? time series. I would start by changing getStreams(...)
https://github.com/usgs/groundmotion-processing/blob/5aa9b8778d37e78d8b7140ff2c7578f0cf933e18/gmprocess/io/asdf/stream_workspace.py#L409
to be named iterStreams(...)
and instead of returning a list:
https://github.com/usgs/groundmotion-processing/blob/5aa9b8778d37e78d8b7140ff2c7578f0cf933e18/gmprocess/io/asdf/stream_workspace.py#L524-L530
it would yield
each stream.
As you can see this would be a significant refactoring, but I think it would be a good improvement especially if this tool is use on denser collections of data.
I think I see now, and clearly it is not useful to load all the streams if we are going to hand them off to separate processes. As a first step, I'll factor out the part where we construct/load the full list of streams. I'll have it iterate over the waveforms, and load only the stream for that waveform within the loop.
While I haven't done exactly what this issue requests, I think the changes in #765 get pretty close.
Is your feature request related to a problem? Please describe. During the metric processing, lists are used to load all of the waveforms and store all of the processed results. For some events, storing all of the data requires a lot of RAM. The maximum that I saw was about 3 GB, but would increase for metrics with more points. (3 GB isn't very much, but that's a lot per process/core).
Describe the solution you'd like Use iterators to loop over the wave forms and save the results as they are computed.
Describe alternatives you've considered If
dask
is more fully embraced this would also be resolved, but this would require a more extensive effort.