uafgeotools / waveform_collection

Collect seismic/infrasound waveforms and metadata from IRIS/WATC/AVO servers, local miniSEED files, etc.
https://uaf-waveform-collection.readthedocs.io/
MIT License
13 stars 7 forks source link

Implemented parallel option when gathering waveforms. #37

Closed ltscamfer closed 1 month ago

ltscamfer commented 2 months ago

I have added two more arguments to gather_waveformsand gather_waveforms_bulk, parallel and cores. The first specifies whether you want to gather the waveforms in parallel and in the second you can choose how many cores you would like to allocate to the parallelization. The parallel processing uses the multiprocess package, and will either parallelize by station or time window depending on how many cores you select and how many stations are in your request. New functions are parallel_gather and get_increment_waveforms, the latter of which is required when calling pool.map()

From my testing it should be set up to run as fast as possible, parallelizing by station when the station count exceeds cores requested, and using time windows otherwise. I experimented a bit with parallelizing by location code (for arrays), however it didn't seem to improve vs using time and I think adds unnecessary complexity. I also added a warning where if you request less than a day of data, it will tell you that running in parallel may be slower.

One small unrelated change is in In line 179 of gather_waveforms . It prints the full stream, however an issue arises when using gather_waveforms_bulk. When using the bulk gather, it prints the unmerged stream, as the merging and trimming is saved for the end and is not used when gather_waveforms is called. This can be confusing as it can sometimes print over 100 traces when your data only contains 30 or so stations. @darren-tpk and I thought about the best way to deal with this and decided just removing the print statement altogether could be the best option, as users can always check the contents of their stream after using either function.

Also, now that multiprocess is a new dependency I changed a few lines on files that I think(?) are related to readthedocs. Please feel free to remove those changed as I'm not positive they are correct!

davidfee5 commented 2 months ago

I'm getting the error TypeError: cannot pickle 'SSLContext' object and a bunch of other stuff. Any ideas? Looks like it is occurring during the pool.map call.

liamtoney commented 1 month ago

Nice! Without testing this, I have just some feedback on the kwarg structure. How about a single argument, n_jobs, that has default n_jobs=1. Then you can use if n_jobs != 1: instead of if parallel: and also simplify invocation of parallel processing for the user since they can turn on parallel with a single kwarg.

Related to this, you could implement something similar to joblib.Parallel's setup for specifying cores:

If -1 is given, joblib tries to use all CPUs. The number of CPUs n_cpus is obtained with cpu_count(). For n_jobs below -1, (n_cpus + 1 + n_jobs) are used. For instance, using n_jobs=-2 will result in all CPUs but one being used.

I've found this really useful since you can set n_jobs=-2 for running on your local system and that's usually a pretty good call — and you don't have to remember how many cores you have. (Wouldn't want to do that on tamarack though, ha!) Just an optional idea here.

Note the change from cores to n_jobs since I believe the latter is more common across Python ecosystem (see, e.g., joblib and this 3rd party example).

ltscamfer commented 1 month ago

I'm getting the error TypeError: cannot pickle 'SSLContext' object and a bunch of other stuff. Any ideas? Looks like it is occurring during the pool.map call.

I think I have an idea what caused this:pool.map()doesn't like being passed the obspy.fdsn.Client object, so I have it re-establishing a client within the function called by pool.map() . I'm really not sure why it was working fine on my local machine as I hadn't seen that error before. Could you try again with the new push? 🤞

Nice! Without testing this, I have just some feedback on the kwarg structure. How about a single argument, n_jobs, that has default n_jobs=1. Then you can use if n_jobs != 1: instead of if parallel: and also simplify invocation of parallel processing for the user since they can turn on parallel with a single kwarg.

@liamtoney, thanks for this! I agree this is a more streamlined approach and just the single kwarg is much nicer! I think I've implemented something similar to joblib.Parallel .