Open SpheMakh opened 3 years ago
Could you re-run in serial mode (--dist-ncpu 1
), on the odd chance that we get something more informative out of it before it dies?
More info when running in serial.
# 2021-03-10 19:38:37.962644: W tensorflow/core/framework/allocator.cc:101] Allocation of 86999040 exceeds 10% of system memory.
# 2021-03-10 19:38:38.064214: W tensorflow/core/framework/allocator.cc:101] Allocation of 173998080 exceeds 10% of system memory.
# 2021-03-10 19:38:38.252604: W tensorflow/core/framework/allocator.cc:101] Allocation of 173998080 exceeds 10% of system memory.
# 2021-03-10 19:38:38.461049: W tensorflow/core/framework/allocator.cc:101] Allocation of 173998080 exceeds 10% of system memory.
@sjperkins any ideas on how to fix this?
How many sources are there in your LSM?
Drop the time chunk down to 10 btw (your time-int is 10 anyway), maybe Montblanc is simply running out of memory?
How many sources are there in your LSM?
More info when running in serial.
# 2021-03-10 19:38:37.962644: W tensorflow/core/framework/allocator.cc:101] Allocation of 86999040 exceeds 10% of system memory. # 2021-03-10 19:38:38.064214: W tensorflow/core/framework/allocator.cc:101] Allocation of 173998080 exceeds 10% of system memory. # 2021-03-10 19:38:38.252604: W tensorflow/core/framework/allocator.cc:101] Allocation of 173998080 exceeds 10% of system memory. # 2021-03-10 19:38:38.461049: W tensorflow/core/framework/allocator.cc:101] Allocation of 173998080 exceeds 10% of system memory.
Yes, this seems to be a case of tensorflow warning about memory allocations. I don't think the above are errors in their own right: 173998080 is 0.174GB which doesn't seem excessive to me.
The standard solution is to fix the batch (chunk) size, which @o-smirnov has suggested. https://stackoverflow.com/a/56851691. Have you had any success with this by modifying the Cubical chunking parameters? Otherwise, Cubical may need to subdivide the data into finer chunks before feeding it into Montblanc.
Cut and pasting from the stackoverflow question
2018-12-05 19:20:44.932780: W tensorflow/core/framework/allocator.cc:122] Allocation of 3359939800 exceeds 10% of system memory.
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
Abandon (core dumped)
Do you see any log messages mentioning that the process was terminated due to std::bad_alloc
?
I am getting an error related to this issue:
INFO 11:53:06 - data_handler [x01] [10.3/11.2 23.5/36.2 9.3Gb] reading FLAG
INFO 11:53:11 - data_handler [x01] [12.1/13.0 24.1/36.8 9.3Gb] reading BITFLAG
INFO 11:53:31 - ms_tile [x01] [16.3/17.2 28.2/41.0 9.3Gb] 83.96% input visibilities flagged and/or deselected
INFO 11:53:37 - data_handler [x01] [17.5/18.4 53.3/66.0 17.7Gb] reading MODEL_DATA
INFO 11:53:49 - main [0.9 12.7 22.5Gb] Exiting with exception: BrokenProcessPool(A process in the process pool was terminated abruptly while the future was running or pending.)
Traceback (most recent call last):
File "/home/kincaid/Software/CubiCal/cubical/main.py", line 578, in main
stats_dict = workers.run_process_loop(ms, tile_list, load_model, single_chunk, solver_type, solver_opts, debug_opts, out_opts)
File "/home/kincaid/Software/CubiCal/cubical/workers.py", line 226, in run_process_loop
return _run_multi_process_loop(ms, load_model, solver_type, solver_opts, debug_opts, out_opts)
File "/home/kincaid/Software/CubiCal/cubical/workers.py", line 286, in _run_multi_process_loop
if not io_futures[itile].result():
File "/usr/lib/python3.6/concurrent/futures/_base.py", line 425, in result
return self.__get_result()
File "/usr/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
I am not using a tigger LSM, only regions file and DicoModel: --model-list MODEL_DATA+-image_DI_beam_2poly_b eam_zwcl.DicoModel@A2631_4sources.reg:image_DI_beam_2poly_beam_zwcl.DicoModel@A2631_4sources.reg
I have also tried running in serial mode --dist-ncpu 1
and lowering time chunk--data-time-chunk 10
but still same error.
full log file: ddcal_0.log
Hmmm, from the log it doesn't seem to go fully serial. Try --dist-ncpu 1 --dist-nworker 0 --dist-nthread 0
for true serial, and post the log please.
INFO 14:57:12 - data_handler [0.9 13.7 9.3Gb] will save output flags into BITFLAG 'cubical' (2), and into FLAG/FLAG_ROW
INFO 14:57:12 - ms_tile [0.9 13.7 9.3Gb] tile 0/22: reading MS rows 0~195299
INFO 14:57:12 - data_handler [0.9 13.7 9.3Gb] reading CORRECTED_DATA
INFO 14:57:21 - data_handler [10.4 24.4 9.3Gb] reading FLAG
INFO 14:57:26 - data_handler [12.2 25.0 9.3Gb] reading BITFLAG
INFO 14:57:46 - ms_tile [16.4 29.2 9.3Gb] 83.96% input visibilities flagged and/or deselected
INFO 14:57:52 - data_handler [17.6 54.2 17.7Gb] reading MODEL_DATA
semaphore initilization: Permission denied
Full log: ddcal_0.log
Woah, that's a new one. Smells like a shared memory issue to me though. Which machine are you on?
crosby
Try now?
It is running, and still has not terminated. What did you change?
I cleaned up a bunch of old semaphores under /dev/shm.
I cleaned up a bunch of old semaphores under /dev/shm.
I'm also getting an error just like Robert's. I'm looking at the directory on my machine (hall) and I'm clueless on what could be old here
@Sinah-astro Unfortunately I do not have an account on Oates. I did ask Oleg to check /dev/shm and he didn't see anything too alarming. Could you please post the full error message along with your parset/command?
@Sinah-astro Unfortunately I do not have an account on Oates. I did ask Oleg to check /dev/shm and he didn't see anything too alarming. Could you please post the full error message along with your parset/command?
I actually meant 'Hall' but here is the error message. I'll also attach the logfile:
INFO 18:08:57 - data_handler [x01] [9.3/10.6 33.8/47.6 70.8Gb] reading MODEL_DATA
semaphore initilization: Permission denied
INFO 18:09:00 - main [1.2 13.8 73.2Gb] Exiting with exception: BrokenProcessPool(A process in the process pool was terminated abruptly while the future was running or pending.)
Traceback (most recent call last):
File "/home/manaka/software/ddfacet/ddfenv/lib/python3.6/site-packages/cubical/main.py", line 578, in main
stats_dict = workers.run_process_loop(ms, tile_list, load_model, single_chunk, solver_type, solver_opts, debug_opts, out_opts)
File "/home/manaka/software/ddfacet/ddfenv/lib/python3.6/site-packages/cubical/workers.py", line 226, in run_process_loop
return _run_multi_process_loop(ms, load_model, solver_type, solver_opts, debug_opts, out_opts)
File "/home/manaka/software/ddfacet/ddfenv/lib/python3.6/site-packages/cubical/workers.py", line 286, in _run_multi_process_loop
if not io_futures[itile].result():
File "/usr/lib/python3.6/concurrent/futures/_base.py", line 425, in result
return self.__get_result()
File "/usr/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
I also saw this error:
ERROR 18:08:43 - casa_db_adaptor [1.1 12.0 66.6Gb] Gaintables cannot be written in Python 3 mode due to current casacore implementation issues
Is it alarming?
The gaintable error is expected. It won't effect your results at all - just means there won't be casa-style gain tables.
Unfortunately, I don't have an account on Hall either. @o-smirnov could you please check when you have a moment? This is likely zombie process related.
Last: no the gain tables writing is deprecated for the moment. Ignore
I notice your memory usage is rather high (the solver itself is sitting at over 10% and you have 5 directions (which blows up the model cube). Try decreasing max-chunks to 1 and see if the memory consumption goes down - it usually does. Unfortunately, this means it will effectively run in serial.
Cheers,
On Thu, Aug 26, 2021 at 8:51 AM Sinah-astro @.***> wrote:
@Sinah-astro https://github.com/Sinah-astro Unfortunately I do not have an account on Oates. I did ask Oleg to check /dev/shm and he didn't see anything too alarming. Could you please post the full error message along with your parset/command?
I actually meant 'Hall' but here is the error message. I'll also attach the logfile:
INFO 18:08:57 - data_handler [x01] [9.3/10.6 33.8/47.6 70.8Gb] reading MODEL_DATA semaphore initilization: Permission denied INFO 18:09:00 - main [1.2 13.8 73.2Gb] Exiting with exception: BrokenProcessPool(A process in the process pool was terminated abruptly while the future was running or pending.) Traceback (most recent call last): File "/home/manaka/software/ddfacet/ddfenv/lib/python3.6/site-packages/cubical/main.py", line 578, in main stats_dict = workers.run_process_loop(ms, tile_list, load_model, single_chunk, solver_type, solver_opts, debug_opts, out_opts) File "/home/manaka/software/ddfacet/ddfenv/lib/python3.6/site-packages/cubical/workers.py", line 226, in run_process_loop return _run_multi_process_loop(ms, load_model, solver_type, solver_opts, debug_opts, out_opts) File "/home/manaka/software/ddfacet/ddfenv/lib/python3.6/site-packages/cubical/workers.py", line 286, in _run_multi_process_loop if not io_futures[itile].result(): File "/usr/lib/python3.6/concurrent/futures/_base.py", line 425, in result return self.get_result() File "/usr/lib/python3.6/concurrent/futures/_base.py", line 384, in get_result raise self._exception concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
cc.log https://github.com/ratt-ru/CubiCal/files/7052172/cc.log cubic.txt https://github.com/ratt-ru/CubiCal/files/7052181/cubic.txt
I also saw this error:
ERROR 18:08:43 - casa_db_adaptor [1.1 12.0 66.6Gb] Gaintables cannot be written in Python 3 mode due to current casacore implementation issues
Is it alarming?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ratt-ru/CubiCal/issues/444#issuecomment-906142532, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB4RE6RY4AI47ARCNUCKVYLT6XP6FANCNFSM4Y5W3BUA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .
Benjamin Hugo
PhD. student, Centre for Radio Astronomy Techniques and Technologies Department of Physics and Electronics Rhodes University
Junior software developer Radio Astronomy Research Group South African Radio Astronomy Observatory Black River Business Park Observatory Cape Town
Also you need to uniquely label your flags please - if they were called cubical during 2GC subsequent runs will unflag previously flagged data where no model exists. I would suggest that once you have this working you should go back and restore your 2GC flags and repredict your DicoModel into MODEL_DATA.
INFO 18:07:41 - main [0.2 11.1 66.6Gb] - apply ............................................. = -cubical INFO 18:07:41 - main [0.2 11.1 66.6Gb] - auto-init ......................................... = legacy INFO 18:07:41 - main [0.2 11.1 66.6Gb] - save .............................................. = cubical
On Thu, Aug 26, 2021 at 9:09 AM Benna Hugo @.***> wrote:
Last: no the gain tables writing is deprecated for the moment. Ignore
I notice your memory usage is rather high (the solver itself is sitting at over 10% and you have 5 directions (which blows up the model cube). Try decreasing max-chunks to 1 and see if the memory consumption goes down - it usually does. Unfortunately, this means it will effectively run in serial.
Cheers,
On Thu, Aug 26, 2021 at 8:51 AM Sinah-astro @.***> wrote:
@Sinah-astro https://github.com/Sinah-astro Unfortunately I do not have an account on Oates. I did ask Oleg to check /dev/shm and he didn't see anything too alarming. Could you please post the full error message along with your parset/command?
I actually meant 'Hall' but here is the error message. I'll also attach the logfile:
INFO 18:08:57 - data_handler [x01] [9.3/10.6 33.8/47.6 70.8Gb] reading MODEL_DATA semaphore initilization: Permission denied INFO 18:09:00 - main [1.2 13.8 73.2Gb] Exiting with exception: BrokenProcessPool(A process in the process pool was terminated abruptly while the future was running or pending.) Traceback (most recent call last): File "/home/manaka/software/ddfacet/ddfenv/lib/python3.6/site-packages/cubical/main.py", line 578, in main stats_dict = workers.run_process_loop(ms, tile_list, load_model, single_chunk, solver_type, solver_opts, debug_opts, out_opts) File "/home/manaka/software/ddfacet/ddfenv/lib/python3.6/site-packages/cubical/workers.py", line 226, in run_process_loop return _run_multi_process_loop(ms, load_model, solver_type, solver_opts, debug_opts, out_opts) File "/home/manaka/software/ddfacet/ddfenv/lib/python3.6/site-packages/cubical/workers.py", line 286, in _run_multi_process_loop if not io_futures[itile].result(): File "/usr/lib/python3.6/concurrent/futures/_base.py", line 425, in result return self.get_result() File "/usr/lib/python3.6/concurrent/futures/_base.py", line 384, in get_result raise self._exception concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
cc.log https://github.com/ratt-ru/CubiCal/files/7052172/cc.log cubic.txt https://github.com/ratt-ru/CubiCal/files/7052181/cubic.txt
I also saw this error:
ERROR 18:08:43 - casa_db_adaptor [1.1 12.0 66.6Gb] Gaintables cannot be written in Python 3 mode due to current casacore implementation issues
Is it alarming?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ratt-ru/CubiCal/issues/444#issuecomment-906142532, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB4RE6RY4AI47ARCNUCKVYLT6XP6FANCNFSM4Y5W3BUA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .
--
Benjamin Hugo
PhD. student, Centre for Radio Astronomy Techniques and Technologies Department of Physics and Electronics Rhodes University
Junior software developer Radio Astronomy Research Group South African Radio Astronomy Observatory Black River Business Park Observatory Cape Town
Benjamin Hugo
PhD. student, Centre for Radio Astronomy Techniques and Technologies Department of Physics and Electronics Rhodes University
Junior software developer Radio Astronomy Research Group South African Radio Astronomy Observatory Black River Business Park Observatory Cape Town
Will I use CASA to restore then set the
parameters --Predict-ColName=COLUMN
and --Predict-InitDicoModel=FILENAME
to MODEL_DATA and dde_mask.DicoModel (Dicomodel from previous DDF run)
respectively? Will this uniquely label them?
Sinah Manaka
On Thu, 26 Aug 2021 at 09:14, Benjamin Hugo @.***> wrote:
Also you need to uniquely label your flags please - if they were called cubical during 2GC subsequent runs will unflag previously flagged data where no model exists. I would suggest that once you have this working you should go back and restore your 2GC flags and repredict your DicoModel into MODEL_DATA.
INFO 18:07:41 - main [0.2 11.1 66.6Gb] - apply ............................................. = -cubical INFO 18:07:41 - main [0.2 11.1 66.6Gb] - auto-init ......................................... = legacy INFO 18:07:41 - main [0.2 11.1 66.6Gb] - save .............................................. = cubical
On Thu, Aug 26, 2021 at 9:09 AM Benna Hugo @.***> wrote:
Last: no the gain tables writing is deprecated for the moment. Ignore
I notice your memory usage is rather high (the solver itself is sitting at over 10% and you have 5 directions (which blows up the model cube). Try decreasing max-chunks to 1 and see if the memory consumption goes down - it usually does. Unfortunately, this means it will effectively run in serial.
Cheers,
On Thu, Aug 26, 2021 at 8:51 AM Sinah-astro @.***> wrote:
@Sinah-astro https://github.com/Sinah-astro Unfortunately I do not have an account on Oates. I did ask Oleg to check /dev/shm and he didn't see anything too alarming. Could you please post the full error message along with your parset/command?
I actually meant 'Hall' but here is the error message. I'll also attach the logfile:
INFO 18:08:57 - data_handler [x01] [9.3/10.6 33.8/47.6 70.8Gb] reading MODEL_DATA semaphore initilization: Permission denied INFO 18:09:00 - main [1.2 13.8 73.2Gb] Exiting with exception: BrokenProcessPool(A process in the process pool was terminated abruptly while the future was running or pending.) Traceback (most recent call last): File "/home/manaka/software/ddfacet/ddfenv/lib/python3.6/site-packages/cubical/main.py", line 578, in main stats_dict = workers.run_process_loop(ms, tile_list, load_model, single_chunk, solver_type, solver_opts, debug_opts, out_opts) File "/home/manaka/software/ddfacet/ddfenv/lib/python3.6/site-packages/cubical/workers.py", line 226, in run_process_loop return _run_multi_process_loop(ms, load_model, solver_type, solver_opts, debug_opts, out_opts) File "/home/manaka/software/ddfacet/ddfenv/lib/python3.6/site-packages/cubical/workers.py", line 286, in _run_multi_process_loop if not io_futures[itile].result(): File "/usr/lib/python3.6/concurrent/futures/_base.py", line 425, in result return self.get_result() File "/usr/lib/python3.6/concurrent/futures/_base.py", line 384, in get_result raise self._exception concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
cc.log https://github.com/ratt-ru/CubiCal/files/7052172/cc.log cubic.txt https://github.com/ratt-ru/CubiCal/files/7052181/cubic.txt
I also saw this error:
ERROR 18:08:43 - casa_db_adaptor [1.1 12.0 66.6Gb] Gaintables cannot be written in Python 3 mode due to current casacore implementation issues
Is it alarming?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ratt-ru/CubiCal/issues/444#issuecomment-906142532, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AB4RE6RY4AI47ARCNUCKVYLT6XP6FANCNFSM4Y5W3BUA
. Triage notifications on the go with GitHub Mobile for iOS < https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675
or Android < https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email
.
--
Benjamin Hugo
PhD. student, Centre for Radio Astronomy Techniques and Technologies Department of Physics and Electronics Rhodes University
Junior software developer Radio Astronomy Research Group South African Radio Astronomy Observatory Black River Business Park Observatory Cape Town
--
Benjamin Hugo
PhD. student, Centre for Radio Astronomy Techniques and Technologies Department of Physics and Electronics Rhodes University
Junior software developer Radio Astronomy Research Group South African Radio Astronomy Observatory Black River Business Park Observatory Cape Town
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ratt-ru/CubiCal/issues/444#issuecomment-906155849, or unsubscribe https://github.com/notifications/unsubscribe-auth/ASRLCMPBIFIXBMXBHNPCZLDT6XSWVANCNFSM4Y5W3BUA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .
This only happens when I use a tigger-lsm, suggesting the issue may be in montblanc. When I used MeqTrees to predict and specified a visibility model, the problem went away with all the parameters unchanged.
The full log is at https://pastebin.com/QknXvYAF