Open pavlis opened 2 years ago
Update. When I ran the second version of this workflow (the one calling only the read_common_source_gather) on a small subset of the data it ran for a long time and died with a similar "killed worker" exception in the notebook. However, the dask log file now shows a ton of errors like the following:
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk. Perhaps some other process is leaking memory? Process memory: 7.33 GB -- Worker memory limit: 10.45 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk. Perhaps some other process is leaking memory? Process memory: 7.33 GB -- Worker memory limit: 10.45 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk. Perhaps some other process is leaking memory? Process memory: 7.34 GB -- Worker memory limit: 10.45 GB
distributed.worker - WARNING - Memory use is high but worker has no data to store to disk. Perhaps some other process is leaking memory? Process memory: 7.35 GB -- Worker memory limit: 10.45 GB
This shows the serialization issue with cursors is an issue I certainly don't understand. It also shows the memory problem in dealing with large ensembles. I can calculate now what the issue is. There are about 3500 channels in the ensembles being read. Each is roughly an hour long (3600 s) at a nominal sample rate of 40 sps. We store all sample data as doubles so the size per ensemble is 35003600408=about 4 GB. The error log suggests we have a memory copy problem as the 24 GB exceed process memory posted of 7.34 GB.
There is a simple workaround for this particular workflow. The simplest is to just run an atomic demean, filter, window, and save workflow on this dataset. Then I can run bundle on the wf_TimeSeries data and I shouldn't have a memory problem.
So, I think it is clear we have some things we need to deal with here:
read_common_source_gather
example is simply combining two steps to avoid using cursor as the intermediate object. Foldby will not solve this issue. It is only the last workaround above can be simplified by the foldby. So, I think if we want to support a workflow that starts with ensembles, we should support an optional query argument in read_ensemble_data
. read_ensemble_data
is pretty much the "variant of the read_common_source_gather function".
I think we have two problems we need to deal with for reading large ensembles in parallel. This should perhaps be put into two issues pages, but they are closely enough related I elected to put them together. The two issues are:
For item 1, the evidence is this. Running the test we discussed on the huge lower 48 broadband array from 2012 I used this parallel construct:
where I commented out the lines that do real processing to limit the workflow to only the call to MongoDB find. (A number of things here are defined earlier in the notebook.) Note I know the syntax is right and the data are intact as a serial test on few ensembles runs. When I run the above I get this error:
Noting this leaves this message in the dask log:
The memory problem surfaces when I use this variant that came from our user's manual parallel processing section examples:
That job with the processing steps commented out seems to run ok. It has been running for about 30 minutes anyway without problems so think it is ok. I am going to kill the job and restore the 0:3 limits to see if it will actually finish. I'll do an update to this page when that finishes.
The memory problem surfaced when I ran this workflow with the map calls to window_ensemble and bundle_seed_data enabled (note they are commented out above). Note window_ensemble is a custom function that is the following:
So that function reduces the ensemble size by about a factor of 10 as the original data files I'm working with are about 4000 s long. I don't know at this point which of these functions is causing the memory fault, but that is actually beside the point. This experience shows we have a serious issue here with how MsPASS handles memory. Our users will experience this kind of mysterious error. As a minimum I think we need to develop some simple command line tools that can be used to compute and report ensemble sizes that would help user's know if they needed to worry about memory usage. To do this right they would need a way to know how many workers the scheduler would spawn. Is there a way to ask dask and spark how many workers the scheduler would use? If so, such a tool would not be hard to write. The rest I could do easily, but wouldn't know how to get that number (number of workers).