Closed larnoldgithub closed 1 year ago
from the terminal: [...]
22:11:23.021- |PROC| --shortname=PP --parallel=True]
22:11:23.078- |PROC| Run ID88432 validated [apero_preprocess_spirou.py 19BQ10-Nov07 2455748o.fits
22:11:23.078- |PROC| --crunfile=run_PPonly.ini --program=PP[88432] --recipe_kind=pre-all
22:11:23.078- |PROC| --shortname=PP --parallel=True]
22:11:23.178- |PROC| Run ID88459 validated [apero_preprocess_spirou.py 19BQ10-Nov06 2455498o.fits
22:11:23.178- |PROC| --crunfile=run_PPonly.ini --program=PP[88459] --recipe_kind=pre-all
22:11:23.178- |PROC| --shortname=PP --parallel=True]
22:11:23.273- |PROC| Run ID88486 validated [apero_preprocess_spirou.py 19BQ10-Nov04 2455149o.fits
22:11:23.273- |PROC| --crunfile=run_PPonly.ini --program=PP[88486] --recipe_kind=pre-all
22:11:23.273- |PROC| --shortname=PP --parallel=True]
22:11:23.580- |PROC| Run ID88513 validated [apero_preprocess_spirou.py 19BQ10-Nov03 2455012o.fits
22:11:23.580- |PROC| --crunfile=run_PPonly.ini --program=PP[88513] --recipe_kind=pre-all
22:11:23.580- |PROC| --shortname=PP --parallel=True]
22:11:24.735- |PROC| Run ID88540 validated [apero_preprocess_spirou.py 19BQ10-Nov03 2454985o.fits
22:11:24.735- |PROC| --crunfile=run_PPonly.ini --program=PP[88540] --recipe_kind=pre-all
22:11:24.735- |PROC| --shortname=PP --parallel=True]
Process Process-16:
Traceback (most recent call last):
File "/conda/miniconda3/envs/apero-env/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/conda/miniconda3/envs/apero-env/lib/python3.9/multiprocessing/process.py", line 108, in run
self._target(*self._args, self._kwargs)
File "/apero/apero-drs/apero/tools/module/processing/drs_processing.py", line 1473, in _multi_generate_id
return_dict[key] = results[key]
File "
So its not a "bug" having the ID numbers processed in the "wrong order". It is parallelised. Each recipe is split into N groups. For you N = 25. i.e. if you have 95000 raw files you get 95000 preprocessing recipes to run. These are split into groups with ids as follows:
It processes the first one of each group then the second, so it isn't weird at all so see the numbers jump around. Its completely safe as nothing that can't be run together is parallelised (i.e. we only parallelise the same recipe run).
As for the crash it looks like a MySQL problem. You either need to reduce the number of cores you are using, raise the number of default connections or run the validation in "linear" mode instead of in parallel (it will be N times slower but it wont crash as easier) - note this is only for the validation - the actual recipes can still run as normal.
You can turn off this parallelisation with the following line added to the user_constants.ini
# Define whether to use multiprocess "pool" or "process" or use "linear"
# mode when validating recipes
REPROCESS_MP_TYPE_VAL = REPROCESS_MP_TYPE_VAL.copy(__NAME__)
REPROCESS_MP_TYPE_VAL.value = 'linear'
Do not confuse this with REPROCESS_MP_TYPE
which if changed will stop the recipes themselves running in parallel.
Thanks Neil, I'm trying with 20 cores now. I didn't change the user_constants.ini file for now.
It crashed also with CORES = 20.
Here is a graph showing the memory usageof today 2 tests ; Max memory is reached.
I have now modified the user_constant.ini file and set back CORES = 27.
Neil, if I add the lines you give above, source /apero/config/offline/offline.bash.setup
crashes so I just changed
REPROCESS_MP_TYPE_VAL = process
to
REPROCESS_MP_TYPE_VAL = linear
this works and indeed the PROC step looks much slower, but seems to work, with no significant impact on RAM.
Sorry yes if the constant is already in there you just need to change it, not add it.
apero_processing.py run_PPonly.ini --test=True crashed as shown in the screenshot below (third one).
I've seen an unexpected behavior of the DRS looking at [PROC] data displaed in the terminal: it goes up to Run93880, then stop a few min, then restarted from RunID 'close to 1' up to 93881, then again stopped, restarted from the beginning etc.. up to 93885 (seventh loop approx), and finally crashed.
I'm running apikipiha with 27 cores.
I saw this 'looping' behavior with the minidataset too: Instead of having PROC going up to RunID 1150 ( I don't remember exactly the value, but close to 1150) it went first to 1144, then restarted from the beginning up to 1145 etc.. but eventually didn't crash and did the processing as expected.