ratt-ru / CubiCal

A fast radio interferometric calibration suite.
GNU General Public License v2.0
18 stars 13 forks source link

Another BrokenProcessPool when using a tigger a LSM #444

Open SpheMakh opened 3 years ago

SpheMakh commented 3 years ago

This only happens when I use a tigger-lsm, suggesting the issue may be in montblanc. When I used MeqTrees to predict and specified a visibility model, the problem went away with all the parameters unchanged.

# INFO      12:35:08 - ms_tile            [io] [1.1/1.3 3.4/5.7 4.0Gb]   computing visibilities for /stimela_mount/output/meerkat-hydra-selfcal-0-skymodel.lsm.html
# INFO      12:35:25 - main               [0.2 2.3 4.0Gb] Exiting with exception: BrokenProcessPool(A process in the process pool was terminated abruptly while the future was running or pending.)
#  Traceback (most recent call last):
#   File "/usr/local/lib/python3.6/dist-packages/cubical/main.py", line 578, in main
#     stats_dict = workers.run_process_loop(ms, tile_list, load_model, single_chunk, solver_type, solver_opts, debug_opts, out_opts)
#   File "/usr/local/lib/python3.6/dist-packages/cubical/workers.py", line 226, in run_process_loop
#     return _run_multi_process_loop(ms, load_model, solver_type, solver_opts, debug_opts, out_opts)
#   File "/usr/local/lib/python3.6/dist-packages/cubical/workers.py", line 286, in _run_multi_process_loop
#     if not io_futures[itile].result():
#   File "/usr/lib/python3.6/concurrent/futures/_base.py", line 425, in result
#     return self.__get_result()
#   File "/usr/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
#     raise self._exception
# concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

The full log is at https://pastebin.com/QknXvYAF

o-smirnov commented 3 years ago

Could you re-run in serial mode (--dist-ncpu 1), on the odd chance that we get something more informative out of it before it dies?

SpheMakh commented 3 years ago

More info when running in serial.

# 2021-03-10 19:38:37.962644: W tensorflow/core/framework/allocator.cc:101] Allocation of 86999040 exceeds 10% of system memory.
# 2021-03-10 19:38:38.064214: W tensorflow/core/framework/allocator.cc:101] Allocation of 173998080 exceeds 10% of system memory.
# 2021-03-10 19:38:38.252604: W tensorflow/core/framework/allocator.cc:101] Allocation of 173998080 exceeds 10% of system memory.
# 2021-03-10 19:38:38.461049: W tensorflow/core/framework/allocator.cc:101] Allocation of 173998080 exceeds 10% of system memory.
SpheMakh commented 3 years ago

@sjperkins any ideas on how to fix this?

o-smirnov commented 3 years ago

How many sources are there in your LSM?

o-smirnov commented 3 years ago

Drop the time chunk down to 10 btw (your time-int is 10 anyway), maybe Montblanc is simply running out of memory?

SpheMakh commented 3 years ago

How many sources are there in your LSM?

only two

sjperkins commented 3 years ago

More info when running in serial.

# 2021-03-10 19:38:37.962644: W tensorflow/core/framework/allocator.cc:101] Allocation of 86999040 exceeds 10% of system memory.
# 2021-03-10 19:38:38.064214: W tensorflow/core/framework/allocator.cc:101] Allocation of 173998080 exceeds 10% of system memory.
# 2021-03-10 19:38:38.252604: W tensorflow/core/framework/allocator.cc:101] Allocation of 173998080 exceeds 10% of system memory.
# 2021-03-10 19:38:38.461049: W tensorflow/core/framework/allocator.cc:101] Allocation of 173998080 exceeds 10% of system memory.

Yes, this seems to be a case of tensorflow warning about memory allocations. I don't think the above are errors in their own right: 173998080 is 0.174GB which doesn't seem excessive to me.

https://stackoverflow.com/questions/53639067/tensorflow-cpu-memory-problem-allocation-exceeds-10-of-system-memory

The standard solution is to fix the batch (chunk) size, which @o-smirnov has suggested. https://stackoverflow.com/a/56851691. Have you had any success with this by modifying the Cubical chunking parameters? Otherwise, Cubical may need to subdivide the data into finer chunks before feeding it into Montblanc.

Cut and pasting from the stackoverflow question

2018-12-05 19:20:44.932780: W tensorflow/core/framework/allocator.cc:122] Allocation of 3359939800 exceeds 10% of system memory.
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
Abandon (core dumped)

Do you see any log messages mentioning that the process was terminated due to std::bad_alloc?

Kincaidr commented 3 years ago

I am getting an error related to this issue:

INFO      11:53:06 - data_handler       [x01] [10.3/11.2 23.5/36.2 9.3Gb] reading FLAG
INFO      11:53:11 - data_handler       [x01] [12.1/13.0 24.1/36.8 9.3Gb] reading BITFLAG
INFO      11:53:31 - ms_tile            [x01] [16.3/17.2 28.2/41.0 9.3Gb]   83.96% input visibilities flagged and/or deselected
INFO      11:53:37 - data_handler       [x01] [17.5/18.4 53.3/66.0 17.7Gb] reading MODEL_DATA
INFO      11:53:49 - main               [0.9 12.7 22.5Gb] Exiting with exception: BrokenProcessPool(A process in the process pool was terminated abruptly while the future was running or pending.)
 Traceback (most recent call last):
  File "/home/kincaid/Software/CubiCal/cubical/main.py", line 578, in main
    stats_dict = workers.run_process_loop(ms, tile_list, load_model, single_chunk, solver_type, solver_opts, debug_opts, out_opts)
  File "/home/kincaid/Software/CubiCal/cubical/workers.py", line 226, in run_process_loop
    return _run_multi_process_loop(ms, load_model, solver_type, solver_opts, debug_opts, out_opts)
  File "/home/kincaid/Software/CubiCal/cubical/workers.py", line 286, in _run_multi_process_loop
    if not io_futures[itile].result():
  File "/usr/lib/python3.6/concurrent/futures/_base.py", line 425, in result
    return self.__get_result()
  File "/usr/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

I am not using a tigger LSM, only regions file and DicoModel: --model-list MODEL_DATA+-image_DI_beam_2poly_b eam_zwcl.DicoModel@A2631_4sources.reg:image_DI_beam_2poly_beam_zwcl.DicoModel@A2631_4sources.reg

I have also tried running in serial mode --dist-ncpu 1 and lowering time chunk--data-time-chunk 10 but still same error.

full log file: ddcal_0.log

o-smirnov commented 3 years ago

Hmmm, from the log it doesn't seem to go fully serial. Try --dist-ncpu 1 --dist-nworker 0 --dist-nthread 0 for true serial, and post the log please.

Kincaidr commented 3 years ago
INFO      14:57:12 - data_handler       [0.9 13.7 9.3Gb]   will save output flags into BITFLAG 'cubical' (2), and into FLAG/FLAG_ROW
INFO      14:57:12 - ms_tile            [0.9 13.7 9.3Gb] tile 0/22: reading MS rows 0~195299
INFO      14:57:12 - data_handler       [0.9 13.7 9.3Gb] reading CORRECTED_DATA
INFO      14:57:21 - data_handler       [10.4 24.4 9.3Gb] reading FLAG
INFO      14:57:26 - data_handler       [12.2 25.0 9.3Gb] reading BITFLAG
INFO      14:57:46 - ms_tile            [16.4 29.2 9.3Gb]   83.96% input visibilities flagged and/or deselected
INFO      14:57:52 - data_handler       [17.6 54.2 17.7Gb] reading MODEL_DATA
semaphore initilization: Permission denied

Full log: ddcal_0.log

o-smirnov commented 3 years ago

Woah, that's a new one. Smells like a shared memory issue to me though. Which machine are you on?

Kincaidr commented 3 years ago

crosby

o-smirnov commented 3 years ago

Try now?

Kincaidr commented 3 years ago

It is running, and still has not terminated. What did you change?

o-smirnov commented 3 years ago

I cleaned up a bunch of old semaphores under /dev/shm.

Sinah-astro commented 3 years ago

I cleaned up a bunch of old semaphores under /dev/shm.

I'm also getting an error just like Robert's. I'm looking at the directory on my machine (hall) and I'm clueless on what could be old here

JSKenyon commented 3 years ago

@Sinah-astro Unfortunately I do not have an account on Oates. I did ask Oleg to check /dev/shm and he didn't see anything too alarming. Could you please post the full error message along with your parset/command?

Sinah-astro commented 3 years ago

@Sinah-astro Unfortunately I do not have an account on Oates. I did ask Oleg to check /dev/shm and he didn't see anything too alarming. Could you please post the full error message along with your parset/command?

I actually meant 'Hall' but here is the error message. I'll also attach the logfile:

INFO      18:08:57 - data_handler       [x01] [9.3/10.6 33.8/47.6 70.8Gb] reading MODEL_DATA
semaphore initilization: Permission denied
INFO      18:09:00 - main               [1.2 13.8 73.2Gb] Exiting with exception: BrokenProcessPool(A process in the process pool was terminated abruptly while the future was running or pending.)
 Traceback (most recent call last):
  File "/home/manaka/software/ddfacet/ddfenv/lib/python3.6/site-packages/cubical/main.py", line 578, in main
    stats_dict = workers.run_process_loop(ms, tile_list, load_model, single_chunk, solver_type, solver_opts, debug_opts, out_opts)
  File "/home/manaka/software/ddfacet/ddfenv/lib/python3.6/site-packages/cubical/workers.py", line 226, in run_process_loop
    return _run_multi_process_loop(ms, load_model, solver_type, solver_opts, debug_opts, out_opts)
  File "/home/manaka/software/ddfacet/ddfenv/lib/python3.6/site-packages/cubical/workers.py", line 286, in _run_multi_process_loop
    if not io_futures[itile].result():
  File "/usr/lib/python3.6/concurrent/futures/_base.py", line 425, in result
    return self.__get_result()
  File "/usr/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

cc.log cubic.txt

I also saw this error:

ERROR     18:08:43 - casa_db_adaptor    [1.1 12.0 66.6Gb] Gaintables cannot be written in Python 3 mode due to current casacore implementation issues

Is it alarming?

JSKenyon commented 3 years ago

The gaintable error is expected. It won't effect your results at all - just means there won't be casa-style gain tables.

Unfortunately, I don't have an account on Hall either. @o-smirnov could you please check when you have a moment? This is likely zombie process related.

bennahugo commented 3 years ago

Last: no the gain tables writing is deprecated for the moment. Ignore

I notice your memory usage is rather high (the solver itself is sitting at over 10% and you have 5 directions (which blows up the model cube). Try decreasing max-chunks to 1 and see if the memory consumption goes down - it usually does. Unfortunately, this means it will effectively run in serial.

Cheers,

On Thu, Aug 26, 2021 at 8:51 AM Sinah-astro @.***> wrote:

@Sinah-astro https://github.com/Sinah-astro Unfortunately I do not have an account on Oates. I did ask Oleg to check /dev/shm and he didn't see anything too alarming. Could you please post the full error message along with your parset/command?

I actually meant 'Hall' but here is the error message. I'll also attach the logfile:

INFO 18:08:57 - data_handler [x01] [9.3/10.6 33.8/47.6 70.8Gb] reading MODEL_DATA semaphore initilization: Permission denied INFO 18:09:00 - main [1.2 13.8 73.2Gb] Exiting with exception: BrokenProcessPool(A process in the process pool was terminated abruptly while the future was running or pending.) Traceback (most recent call last): File "/home/manaka/software/ddfacet/ddfenv/lib/python3.6/site-packages/cubical/main.py", line 578, in main stats_dict = workers.run_process_loop(ms, tile_list, load_model, single_chunk, solver_type, solver_opts, debug_opts, out_opts) File "/home/manaka/software/ddfacet/ddfenv/lib/python3.6/site-packages/cubical/workers.py", line 226, in run_process_loop return _run_multi_process_loop(ms, load_model, solver_type, solver_opts, debug_opts, out_opts) File "/home/manaka/software/ddfacet/ddfenv/lib/python3.6/site-packages/cubical/workers.py", line 286, in _run_multi_process_loop if not io_futures[itile].result(): File "/usr/lib/python3.6/concurrent/futures/_base.py", line 425, in result return self.get_result() File "/usr/lib/python3.6/concurrent/futures/_base.py", line 384, in get_result raise self._exception concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

cc.log https://github.com/ratt-ru/CubiCal/files/7052172/cc.log cubic.txt https://github.com/ratt-ru/CubiCal/files/7052181/cubic.txt

I also saw this error:

ERROR 18:08:43 - casa_db_adaptor [1.1 12.0 66.6Gb] Gaintables cannot be written in Python 3 mode due to current casacore implementation issues

Is it alarming?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ratt-ru/CubiCal/issues/444#issuecomment-906142532, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB4RE6RY4AI47ARCNUCKVYLT6XP6FANCNFSM4Y5W3BUA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

--

Benjamin Hugo

PhD. student, Centre for Radio Astronomy Techniques and Technologies Department of Physics and Electronics Rhodes University

Junior software developer Radio Astronomy Research Group South African Radio Astronomy Observatory Black River Business Park Observatory Cape Town

bennahugo commented 3 years ago

Also you need to uniquely label your flags please - if they were called cubical during 2GC subsequent runs will unflag previously flagged data where no model exists. I would suggest that once you have this working you should go back and restore your 2GC flags and repredict your DicoModel into MODEL_DATA.

INFO 18:07:41 - main [0.2 11.1 66.6Gb] - apply ............................................. = -cubical INFO 18:07:41 - main [0.2 11.1 66.6Gb] - auto-init ......................................... = legacy INFO 18:07:41 - main [0.2 11.1 66.6Gb] - save .............................................. = cubical

On Thu, Aug 26, 2021 at 9:09 AM Benna Hugo @.***> wrote:

Last: no the gain tables writing is deprecated for the moment. Ignore

I notice your memory usage is rather high (the solver itself is sitting at over 10% and you have 5 directions (which blows up the model cube). Try decreasing max-chunks to 1 and see if the memory consumption goes down - it usually does. Unfortunately, this means it will effectively run in serial.

Cheers,

On Thu, Aug 26, 2021 at 8:51 AM Sinah-astro @.***> wrote:

@Sinah-astro https://github.com/Sinah-astro Unfortunately I do not have an account on Oates. I did ask Oleg to check /dev/shm and he didn't see anything too alarming. Could you please post the full error message along with your parset/command?

I actually meant 'Hall' but here is the error message. I'll also attach the logfile:

INFO 18:08:57 - data_handler [x01] [9.3/10.6 33.8/47.6 70.8Gb] reading MODEL_DATA semaphore initilization: Permission denied INFO 18:09:00 - main [1.2 13.8 73.2Gb] Exiting with exception: BrokenProcessPool(A process in the process pool was terminated abruptly while the future was running or pending.) Traceback (most recent call last): File "/home/manaka/software/ddfacet/ddfenv/lib/python3.6/site-packages/cubical/main.py", line 578, in main stats_dict = workers.run_process_loop(ms, tile_list, load_model, single_chunk, solver_type, solver_opts, debug_opts, out_opts) File "/home/manaka/software/ddfacet/ddfenv/lib/python3.6/site-packages/cubical/workers.py", line 226, in run_process_loop return _run_multi_process_loop(ms, load_model, solver_type, solver_opts, debug_opts, out_opts) File "/home/manaka/software/ddfacet/ddfenv/lib/python3.6/site-packages/cubical/workers.py", line 286, in _run_multi_process_loop if not io_futures[itile].result(): File "/usr/lib/python3.6/concurrent/futures/_base.py", line 425, in result return self.get_result() File "/usr/lib/python3.6/concurrent/futures/_base.py", line 384, in get_result raise self._exception concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

cc.log https://github.com/ratt-ru/CubiCal/files/7052172/cc.log cubic.txt https://github.com/ratt-ru/CubiCal/files/7052181/cubic.txt

I also saw this error:

ERROR 18:08:43 - casa_db_adaptor [1.1 12.0 66.6Gb] Gaintables cannot be written in Python 3 mode due to current casacore implementation issues

Is it alarming?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ratt-ru/CubiCal/issues/444#issuecomment-906142532, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB4RE6RY4AI47ARCNUCKVYLT6XP6FANCNFSM4Y5W3BUA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

--

Benjamin Hugo

PhD. student, Centre for Radio Astronomy Techniques and Technologies Department of Physics and Electronics Rhodes University

Junior software developer Radio Astronomy Research Group South African Radio Astronomy Observatory Black River Business Park Observatory Cape Town

--

Benjamin Hugo

PhD. student, Centre for Radio Astronomy Techniques and Technologies Department of Physics and Electronics Rhodes University

Junior software developer Radio Astronomy Research Group South African Radio Astronomy Observatory Black River Business Park Observatory Cape Town

Sinah-astro commented 3 years ago

Will I use CASA to restore then set the parameters --Predict-ColName=COLUMN and --Predict-InitDicoModel=FILENAME to MODEL_DATA and dde_mask.DicoModel (Dicomodel from previous DDF run) respectively? Will this uniquely label them?

Sinah Manaka

On Thu, 26 Aug 2021 at 09:14, Benjamin Hugo @.***> wrote:

Also you need to uniquely label your flags please - if they were called cubical during 2GC subsequent runs will unflag previously flagged data where no model exists. I would suggest that once you have this working you should go back and restore your 2GC flags and repredict your DicoModel into MODEL_DATA.

INFO 18:07:41 - main [0.2 11.1 66.6Gb] - apply ............................................. = -cubical INFO 18:07:41 - main [0.2 11.1 66.6Gb] - auto-init ......................................... = legacy INFO 18:07:41 - main [0.2 11.1 66.6Gb] - save .............................................. = cubical

On Thu, Aug 26, 2021 at 9:09 AM Benna Hugo @.***> wrote:

Last: no the gain tables writing is deprecated for the moment. Ignore

I notice your memory usage is rather high (the solver itself is sitting at over 10% and you have 5 directions (which blows up the model cube). Try decreasing max-chunks to 1 and see if the memory consumption goes down - it usually does. Unfortunately, this means it will effectively run in serial.

Cheers,

On Thu, Aug 26, 2021 at 8:51 AM Sinah-astro @.***> wrote:

@Sinah-astro https://github.com/Sinah-astro Unfortunately I do not have an account on Oates. I did ask Oleg to check /dev/shm and he didn't see anything too alarming. Could you please post the full error message along with your parset/command?

I actually meant 'Hall' but here is the error message. I'll also attach the logfile:

INFO 18:08:57 - data_handler [x01] [9.3/10.6 33.8/47.6 70.8Gb] reading MODEL_DATA semaphore initilization: Permission denied INFO 18:09:00 - main [1.2 13.8 73.2Gb] Exiting with exception: BrokenProcessPool(A process in the process pool was terminated abruptly while the future was running or pending.) Traceback (most recent call last): File "/home/manaka/software/ddfacet/ddfenv/lib/python3.6/site-packages/cubical/main.py", line 578, in main stats_dict = workers.run_process_loop(ms, tile_list, load_model, single_chunk, solver_type, solver_opts, debug_opts, out_opts) File "/home/manaka/software/ddfacet/ddfenv/lib/python3.6/site-packages/cubical/workers.py", line 226, in run_process_loop return _run_multi_process_loop(ms, load_model, solver_type, solver_opts, debug_opts, out_opts) File "/home/manaka/software/ddfacet/ddfenv/lib/python3.6/site-packages/cubical/workers.py", line 286, in _run_multi_process_loop if not io_futures[itile].result(): File "/usr/lib/python3.6/concurrent/futures/_base.py", line 425, in result return self.get_result() File "/usr/lib/python3.6/concurrent/futures/_base.py", line 384, in get_result raise self._exception concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

cc.log https://github.com/ratt-ru/CubiCal/files/7052172/cc.log cubic.txt https://github.com/ratt-ru/CubiCal/files/7052181/cubic.txt

I also saw this error:

ERROR 18:08:43 - casa_db_adaptor [1.1 12.0 66.6Gb] Gaintables cannot be written in Python 3 mode due to current casacore implementation issues

Is it alarming?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ratt-ru/CubiCal/issues/444#issuecomment-906142532, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AB4RE6RY4AI47ARCNUCKVYLT6XP6FANCNFSM4Y5W3BUA

. Triage notifications on the go with GitHub Mobile for iOS < https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675

or Android < https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email

.

--

Benjamin Hugo

PhD. student, Centre for Radio Astronomy Techniques and Technologies Department of Physics and Electronics Rhodes University

Junior software developer Radio Astronomy Research Group South African Radio Astronomy Observatory Black River Business Park Observatory Cape Town

--

Benjamin Hugo

PhD. student, Centre for Radio Astronomy Techniques and Technologies Department of Physics and Electronics Rhodes University

Junior software developer Radio Astronomy Research Group South African Radio Astronomy Observatory Black River Business Park Observatory Cape Town

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ratt-ru/CubiCal/issues/444#issuecomment-906155849, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASRLCMPBIFIXBMXBHNPCZLDT6XSWVANCNFSM4Y5W3BUA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .