ratt-ru / CubiCal

A fast radio interferometric calibration suite.
GNU General Public License v2.0
18 stars 13 forks source link

Cryptic BrokenProcessPool exception #441

Closed SpheMakh closed 3 years ago

SpheMakh commented 3 years ago

I'm not sure how to describe this, except for saying it happened. Has anyone seen it before

# INFO      23:09:58 - main               [io] [6.9/7.1 9.7/12.9 40.4Gb] loading tile 1/4
# INFO      23:09:58 - ms_tile            [io] [6.9/7.1 9.7/12.9 40.4Gb] tile 1/4: reading MS rows 109800~204959
# INFO      23:09:58 - main               [0.2/7.1 3.1/12.9 40.4Gb] submitting solver jobs for tile 0/4
# INFO      23:09:58 - data_handler       [io] [6.9/7.1 9.7/12.9 40.4Gb] reading DATA
# INFO      23:11:43 - data_handler       [io] [30.1/111.8 35.9/135.4 39.4Gb] reading BITFLAG
# INFO      23:13:45 - ms_tile            [io] [43.2/134.4 46.0/153.5 32.6Gb]   74.30% input visibilities flagged and/or deselected
# INFO      23:17:25 - solver             [x04] [21.9/89.3 25.0/105.7 41.8Gb] D0T0F3 has 2244 (0.00%) new data flags: MadMax took out 2244 visibilities (0 in final round)
# INFO      23:18:00 - solver             [x03] [21.9/73.2 25.0/104.5 44.6Gb] D0T0F2 has 2902 (0.00%) new data flags: MadMax took out 2902 visibilities (0 in final round)
# INFO      23:18:29 - solver             [x01] [21.9/75.3 25.0/141.6 47.7Gb] D0T0F0 has 396 (0.00%) new data flags: MadMax took out 396 visibilities (0 in final round)
# INFO      23:19:21 - solver             [x02] [21.9/76.4 25.1/119.7 51.9Gb] D0T0F1 has 33056 (0.03%) new data flags: MadMax took out 33056 visibilities (0 in final round)
# INFO      23:28:13 - solver             [x04] [22.0/94.0 25.1/109.5 55.7Gb] D0T0F4 has 478680 (0.43%) new data flags: MadMax took out 478680 visibilities (0 in final round)
# INFO      23:28:41 - solver             [x03] [22.0/69.7 25.1/85.3 57.4Gb] D0T0F5 has 476000 (0.43%) new data flags: MadMax took out 476000 visibilities (0 in final round)
# INFO      23:31:41 - solver             [x01] [22.0/53.9 25.1/69.7 59.2Gb] D0T0F6 has 152152 (0.14%) new data flags: MadMax took out 152152 visibilities (0 in final round)
# INFO      23:34:48 - solver             [x02] [22.0/32.7 25.1/48.2 61.0Gb] D0T0F7 has 293768 (0.27%) new data flags: MadMax took out 293768 visibilities (0 in final round)
# INFO      23:34:54 - main               [0.1/14.2 3.3/29.7 62.7Gb] finished processing tile 0/4
# INFO      23:34:54 - main               [0.1/14.2 3.3/29.7 62.7Gb] waiting for I/O on tile 1/4
# INFO      23:34:55 - main               [0.1/14.2 3.3/29.7 62.7Gb] Exiting with exception: BrokenProcessPool(A process in the process pool was terminated abruptly while the future was running or pending.)
#  Traceback (most recent call last):
#   File "/usr/local/lib/python3.6/dist-packages/cubical/main.py", line 578, in main
#     stats_dict = workers.run_process_loop(ms, tile_list, load_model, single_chunk, solver_type, solver_opts, debug_opts, out_opts)
#   File "/usr/local/lib/python3.6/dist-packages/cubical/workers.py", line 226, in run_process_loop
#     return _run_multi_process_loop(ms, load_model, solver_type, solver_opts, debug_opts, out_opts)
#   File "/usr/local/lib/python3.6/dist-packages/cubical/workers.py", line 286, in _run_multi_process_loop
#     if not io_futures[itile].result():
#   File "/usr/lib/python3.6/concurrent/futures/_base.py", line 425, in result
#     return self.__get_result()
#   File "/usr/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
#     raise self._exception
# concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
o-smirnov commented 3 years ago

Any errors above that? (Maybe put the full log on pastebin?)

BrokenProcessPool is just fancy speak for a subprocess (probably the I/O one, looking at the context) segfaulting or running out of memory (how much you got?) or otherwise blowing up. It may be cryptic, but it's already an improvement on just hanging there (see e.g. https://github.com/ratt-ru/CubiCal/issues/439), which older versions of concurrent.futures did. We should add an exception handler to print a prettier message though.

SpheMakh commented 3 years ago

I had memory on the previous run, then tweaked a few but got this instead. The machine has 188Gb total memory

Full log is here https://pastebin.com/Mhiu6M9s

SpheMakh commented 3 years ago

It goes away after I tweaked some of the dist-<foo> options. Indeed, a prettier error message telling the user to adjust dist-<foo> options will do

o-smirnov commented 3 years ago

I'll work on a better error as part of https://github.com/ratt-ru/CubiCal/issues/439. It's a bit tricky to catch when subprocesses just die on you with no warning, as they do for out-of-memory...

SpheMakh commented 3 years ago

Should I close this?

o-smirnov commented 3 years ago

Yeah it's a duplicate of #439 effectively.