ratt-ru / CubiCal

A fast radio interferometric calibration suite.
GNU General Public License v2.0
18 stars 13 forks source link

bomb out with lots of complaints if I/O worker dies #439

Open o-smirnov opened 3 years ago

o-smirnov commented 3 years ago

If the I/O worker dies, this is a little hard for the end user to diagnose, as the solver workers carry on and fill up the log with messages. The error message is then buried somewhere mid-log and the whole process hangs waiting on I/O, instead of exiting with an error.

Surely a subprocess error is catchable at the main process level. https://github.com/ratt-ru/CubiCal/issues/319 is related.

Mulan-94 commented 3 years ago

@o-smirnov Any fix/workaround for this yet? It's gotten me twice this weekend. I tried reducing --dist-ncpu, --dist-min-chunks from 7 to 4 to no avail

INFO      19:42:07 - main               [4.0/85.0 18.2/131.8 247.6Gb] Exiting with exception: BrokenProcessPool(A process in the process pool was terminated abruptly while the future was running or pending.)
 Traceback (most recent call last):
  File "/home/CubiCal/cubical/main.py", line 582, in main
    stats_dict = workers.run_process_loop(ms, tile_list, load_model, single_chunk, solver_type, solver_opts, debug_opts, out_opts)
  File "/home/CubiCal/cubical/workers.py", line 226, in run_process_loop
    return _run_multi_process_loop(ms, load_model, solver_type, solver_opts, debug_opts, out_opts)
  File "/home/CubiCal/cubical/workers.py", line 312, in _run_multi_process_loop
    stats = future.result()
  File "/usr/lib/python3.6/concurrent/futures/_base.py", line 425, in result
    return self.__get_result()
  File "/usr/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
bennahugo commented 3 years ago

Decrease the chunk size and set --dist-max-chunks instead of --dist-min-chunks. It will run in serial but it will reduce the footprint.

On Sun, Sep 19, 2021 at 9:51 PM Lexy Andati @.***> wrote:

@o-smirnov https://github.com/o-smirnov Any fix/workaround for this yet? It's gotten me twice this weekend. I tried reducing --dist-ncpu, --dist-min-chunks from 7 to 4 to no avail

INFO 19:42:07 - main [4.0/85.0 18.2/131.8 247.6Gb] Exiting with exception: BrokenProcessPool(A process in the process pool was terminated abruptly while the future was running or pending.) Traceback (most recent call last): File "/home/CubiCal/cubical/main.py", line 582, in main stats_dict = workers.run_process_loop(ms, tile_list, load_model, single_chunk, solver_type, solver_opts, debug_opts, out_opts) File "/home/CubiCal/cubical/workers.py", line 226, in run_process_loop return _run_multi_process_loop(ms, load_model, solver_type, solver_opts, debug_opts, out_opts) File "/home/CubiCal/cubical/workers.py", line 312, in _run_multi_process_loop stats = future.result() File "/usr/lib/python3.6/concurrent/futures/_base.py", line 425, in result return self.get_result() File "/usr/lib/python3.6/concurrent/futures/_base.py", line 384, in get_result raise self._exception concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ratt-ru/CubiCal/issues/439#issuecomment-922526888, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB4RE6SDTUK7DLPJOQZH72DUCY5LLANCNFSM4YS4AFXA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

--

Benjamin Hugo

PhD. student, Centre for Radio Astronomy Techniques and Technologies Department of Physics and Electronics Rhodes University

Junior software developer Radio Astronomy Research Group South African Radio Astronomy Observatory Black River Business Park Observatory Cape Town

kendak333 commented 2 years ago

I'm running into this BrokenProcessPool error with an oom-kill notice at the end of the log file - taking that to mean the system thinks I'll run out RAM at some point, so kills it. What I can't understand is that earlier in the log when it's calculating all the memory requirements, it says my max memory requirement will be ~57 GB - the system I'm running on has max 62 GB available, so I don't know why things are being killed.

I'm using --data-freq-chunk=256 (reduced down from 1024), --data-time-chunk=36, --dist-max-chunks=2, and ncpus=20 (the max available on the node). What other memory-related knobs can I twiddle to try solve this? It's only 2 hours of data, but running into the same issue with even smaller MSs as well.

JSKenyon commented 2 years ago

The memory estimation is just that - a guess based on some empirical experiments I did. So take it with a pinch of salt. If it is an option, I would really suggest taking a look at QuartiCal. It is much less memory hungry, and has fewer knobs to boot. I am only too happy to help you out on that front.

That said, could you please post your log and config. That will help identify what is going wrong.

kendak333 commented 2 years ago

@JSKenyon running it as part of oxkat - guess we can have a chat about incorporating QuartiCal on an ad hoc basis. I'll take a look at it. But for now, here's the log and the parset CL2GC912_cubical.zip

and the code run was gocubical /data/knowles/mkatot/reruns/data/cubical/2GC_delaycal.parset --data-ms=1563148862_sdp_l0_1024ch_J0046.4-3912.ms --out-dir /data/knowles/mkatot/reruns/GAINTABLES/delaycal_J0046.4-3912_2022-03-01-10-17-13.cc/ --out-name delaycal_J0046.4-3912_2022-03-01-10-17-13 --k-save-to delaycal_J0046.4-3912.parmdb --data-freq-chunk=256 .

JSKenyon commented 2 years ago

OK, in this instance I suspect it is just the fact that the memory footprint is underestimated. I think that the easiest solution in this instance is to set --dist-ncpu=3. Simply put, each the memory footprint of each worker is just too large to use all the cores (or even 5+1 for I/O as in the log you sent). This is unfortunate and will make things slower. On a positive note, hopefully people will start onboarding QuartiCal which does much better in this regard. Apologies for not having a better solution for you.

kendak333 commented 2 years ago

Ok thanks, I'll give that a go.

IanHeywood commented 2 years ago

The oxkat defaults are tuned so they work on standard worker nodes at IDIA and CHPC for standard MeerKAT continuum processing (assuming 1024 channel data). The settings should actually leave a fair bit of overhead to account for things like differing numbers of antennas, and the slurm / PBS controllers being quite trigger happy when jobs step out of line in terms of memory usage. But if you have a node with 64 GB of RAM then the defaults will certainly be too ambitious.

Is this running on hippo?

Also I'm not sure whether moving from a a single solution for the entire band (--data-freq-chunk = 1024) to moving to four solutions across the band (--data-freq-chunk = 256) will reduce the quality of your delay solutions, particularly for those quarter-band chunks that have high RFI occupancy. You might want to check if reverting to a 1024 channel solution gives better results. You could drop --dist-ncpu further, and/or reduce --data-time-chunk to accommodate this. Note that the latter is 36 by default, but that encompasses 9 individual solution intervals (--k-time-int 4).

Cheers.

PS: @JSKenyon swapping to QuartiCal remains on my to-do list!