Multiprocessing crash - Githubissues

larnoldgithub commented 1 year ago

apero_processing.py run_PPonly.ini --test=True crashed as shown in the screenshot below (third one).

I've seen an unexpected behavior of the DRS looking at [PROC] data displaed in the terminal: it goes up to Run93880, then stop a few min, then restarted from RunID 'close to 1' up to 93881, then again stopped, restarted from the beginning etc.. up to 93885 (seventh loop approx), and finally crashed.

I'm running apikipiha with 27 cores.

I saw this 'looping' behavior with the minidataset too: Instead of having PROC going up to RunID 1150 ( I don't remember exactly the value, but close to 1150) it went first to 1144, then restarted from the beginning up to 1145 etc.. but eventually didn't crash and did the processing as expected.

larnoldgithub commented 1 year ago

from the terminal: [...]

22:11:23.021- |PROC| --shortname=PP --parallel=True] 22:11:23.078- |PROC| Run ID88432 validated [apero_preprocess_spirou.py 19BQ10-Nov07 2455748o.fits 22:11:23.078- |PROC| --crunfile=run_PPonly.ini --program=PP[88432] --recipe_kind=pre-all 22:11:23.078- |PROC| --shortname=PP --parallel=True] 22:11:23.178- |PROC| Run ID88459 validated [apero_preprocess_spirou.py 19BQ10-Nov06 2455498o.fits 22:11:23.178- |PROC| --crunfile=run_PPonly.ini --program=PP[88459] --recipe_kind=pre-all 22:11:23.178- |PROC| --shortname=PP --parallel=True] 22:11:23.273- |PROC| Run ID88486 validated [apero_preprocess_spirou.py 19BQ10-Nov04 2455149o.fits 22:11:23.273- |PROC| --crunfile=run_PPonly.ini --program=PP[88486] --recipe_kind=pre-all 22:11:23.273- |PROC| --shortname=PP --parallel=True] 22:11:23.580- |PROC| Run ID88513 validated [apero_preprocess_spirou.py 19BQ10-Nov03 2455012o.fits 22:11:23.580- |PROC| --crunfile=run_PPonly.ini --program=PP[88513] --recipe_kind=pre-all 22:11:23.580- |PROC| --shortname=PP --parallel=True] 22:11:24.735- |PROC| Run ID88540 validated [apero_preprocess_spirou.py 19BQ10-Nov03 2454985o.fits 22:11:24.735- |PROC| --crunfile=run_PPonly.ini --program=PP[88540] --recipe_kind=pre-all 22:11:24.735- |PROC| --shortname=PP --parallel=True] Process Process-16: Traceback (most recent call last): File "/conda/miniconda3/envs/apero-env/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/conda/miniconda3/envs/apero-env/lib/python3.9/multiprocessing/process.py", line 108, in run self._target(*self._args, self._kwargs) File "/apero/apero-drs/apero/tools/module/processing/drs_processing.py", line 1473, in _multi_generate_id return_dict[key] = results[key] File "", line 2, in setitem File "/conda/miniconda3/envs/apero-env/lib/python3.9/multiprocessing/managers.py", line 810, in _callmethod kind, result = conn.recv() File "/conda/miniconda3/envs/apero-env/lib/python3.9/multiprocessing/connection.py", line 250, in recv buf = self._recv_bytes() File "/conda/miniconda3/envs/apero-env/lib/python3.9/multiprocessing/connection.py", line 414, in _recv_bytes buf = self._recv(4) File "/conda/miniconda3/envs/apero-env/lib/python3.9/multiprocessing/connection.py", line 383, in _recv raise EOFError EOFError 22:11:30.747-!!|PROC| E[01-010-00001]: Unhandled error has occurred: Error <class 'ConnectionRefusedError'> 22:11:30.750-!!|PROC| 22:11:30.750-!!|PROC| Traceback (most recent call last): 22:11:30.750-!!|PROC| File "/conda/miniconda3/envs/apero-env/lib/python3.9/multiprocessing/managers.py", line 802, in _callmethod 22:11:30.750-!!|PROC| conn = self._tls.connection 22:11:30.751-!!|PROC| AttributeError: 'ForkAwareLocal' object has no attribute 'connection' 22:11:30.751-!!|PROC| 22:11:30.751-!!|PROC| During handling of the above exception, another exception occurred: 22:11:30.751-!!|PROC| 22:11:30.751-!!|PROC| Traceback (most recent call last): 22:11:30.752-!!|PROC| File "/apero/apero-drs/apero/core/utils/drs_startup.py", line 432, in run 22:11:30.752-!!|PROC| llmain = func(recipe, params) 22:11:30.752-!!|PROC| File "/apero/apero-drs/tools/bin/apero_processing.py", line 208, in main 22:11:30.752-!!|PROC| raise e 22:11:30.752-!!|PROC| File "/apero/apero-drs/tools/bin/apero_processing.py", line 166, in main 22:11:30.753-!!|PROC| rlist = drs_processing.generate_run_list(params, findexdbm, runtable, 22:11:30.753-!!|PROC| File "/apero/apero-drs/apero/tools/module/processing/drs_processing.py", line 894, in generate_run_list 22:11:30.753-!!|PROC| return generate_ids(params, findexdbm, runtable, skiptable, rlist) 22:11:30.753-!!|PROC| File "/apero/apero-drs/apero/tools/module/processing/drs_processing.py", line 1354, in generate_ids 22:11:30.753-!!|PROC| results = _multi_process_gen_ids_process(params, groups[group], 22:11:30.754-!!|PROC| File "/apero/apero-drs/apero/tools/module/processing/drs_processing.py", line 1639, in _multi_process_gen_ids_process 22:11:30.754-!!|PROC| for key in return_dict.keys(): 22:11:30.754-!!|PROC| File "", line 2, in keys 22:11:30.754-!!|PROC| File "/conda/miniconda3/envs/apero-env/lib/python3.9/multiprocessing/managers.py", line 806, in _callmethod 22:11:30.754-!!|PROC| self._connect() 22:11:30.755-!!|PROC| File "/conda/miniconda3/envs/apero-env/lib/python3.9/multiprocessing/managers.py", line 793, in _connect 22:11:30.755-!!|PROC| conn = self._Client(self._token.address, authkey=self._authkey) 22:11:30.755-!!|PROC| File "/conda/miniconda3/envs/apero-env/lib/python3.9/multiprocessing/connection.py", line 502, in Client 22:11:30.755-!!|PROC| c = SocketClient(address) 22:11:30.755-!!|PROC| File "/conda/miniconda3/envs/apero-env/lib/python3.9/multiprocessing/connection.py", line 630, in SocketClient 22:11:30.756-!!|PROC| s.connect(address) 22:11:30.756-!!|PROC| ConnectionRefusedError: [Errno 111] Connection refused 22:11:30.756-!!|PROC| 22:11:31.142-|PROC| * 22:11:31.174-@!|PROC| W[40-003-00005]: Recipe apero_processing has NOT been successfully completed 22:11:31.231-*|PROC| offline Mon Aug 21 12:13:21 spdrs@apikipiha: /data/spirou4/apero-data/offline/runs

njcuk9999 commented 1 year ago

So its not a "bug" having the ID numbers processed in the "wrong order". It is parallelised. Each recipe is split into N groups. For you N = 25. i.e. if you have 95000 raw files you get 95000 preprocessing recipes to run. These are split into groups with ids as follows:

group1: 0 to 95000/25
group2: 95000/25 to 2 * 95000/25
...
group25: 24 95000/25 to 25 95000/25

It processes the first one of each group then the second, so it isn't weird at all so see the numbers jump around. Its completely safe as nothing that can't be run together is parallelised (i.e. we only parallelise the same recipe run).

As for the crash it looks like a MySQL problem. You either need to reduce the number of cores you are using, raise the number of default connections or run the validation in "linear" mode instead of in parallel (it will be N times slower but it wont crash as easier) - note this is only for the validation - the actual recipes can still run as normal.

You can turn off this parallelisation with the following line added to the user_constants.ini

# Define whether to use multiprocess "pool" or "process" or use "linear"
#     mode when validating recipes
REPROCESS_MP_TYPE_VAL = REPROCESS_MP_TYPE_VAL.copy(__NAME__)
REPROCESS_MP_TYPE_VAL.value = 'linear'

Do not confuse this with REPROCESS_MP_TYPE which if changed will stop the recipes themselves running in parallel.

larnoldgithub commented 1 year ago

Thanks Neil, I'm trying with 20 cores now. I didn't change the user_constants.ini file for now.

larnoldgithub commented 1 year ago

It crashed also with CORES = 20.

Here is a graph showing the memory usageof today 2 tests ; Max memory is reached.

PROC-before-crash-memory-usage-Screenshot 2023-08-21 at 15 21 45

larnoldgithub commented 1 year ago

I have now modified the user_constant.ini file and set back CORES = 27.

Neil, if I add the lines you give above, source /apero/config/offline/offline.bash.setup crashes so I just changed REPROCESS_MP_TYPE_VAL = process to REPROCESS_MP_TYPE_VAL = linear

this works and indeed the PROC step looks much slower, but seems to work, with no significant impact on RAM.

njcuk9999 commented 1 year ago

Sorry yes if the constant is already in there you just need to change it, not add it.

njcuk9999 / apero-drs

Multiprocessing crash #711