lofar-astron / factor

Facet calibration for LOFAR
http://www.astron.nl/citt/facet-doc
GNU General Public License v2.0
19 stars 12 forks source link

crash factor/calibrate-standalone #48

Closed rvweeren closed 8 years ago

rvweeren commented 8 years ago

Not idea why this error occurred or what it means, see below the log.

DEBUG - No. of patches used in the model: 1 DEBUG - Building expression tree... done. DEBUG - timer: avg = 5.73 ms, total = 5.73 ms, count = 1 INFO - Selection: Baselines: 1485/1485 Correlations: 4/4 Parameters: 110/110 INFO - No. of coefficients to estimate: 110 DEBUG - chunk: 1/3 cells: 20 iterations: 101 status: 0/20/0/0 converged/stopped/noreduction/singular DEBUG - counters: all: 143985600 flagged: 5157868 zero weight: 0 invalid residual: 0 invalid derivative: 0 invalid weight: 0 outlier: 0 DEBUG - timers: evaluate: 20.1213/149985/0.000134156 coeff map: 0.257675/599940/4.29501e-07 process cell: 26.3867/11998800/2.19911e-06 modify sample: 4.39511/138827732/3.16588e-08 condition eq: 13.8299/138827732/9.96193e-08 total/count/average DEBUG - timers: all: 56.2889/1/56.2889 equate: 47.9561/101/0.474813 iterate: 8.27592/101/0.0819398 total/count/average DEBUG - chunk: 2/3 cells: 20 iterations: 101 status: 0/20/0/0 converged/stopped/noreduction/singular DEBUG - counters: all: 143985600 flagged: 5260888 zero weight: 0 invalid residual: 0 invalid derivative: 0 invalid weight: 0 outlier: 0 DEBUG - timers: evaluate: 20.1791/149985/0.000134541 coeff map: 0.261212/599940/4.35396e-07 process cell: 25.794/11998800/2.14971e-06 modify sample: 3.85953/138724712/2.78215e-08 condition eq: 13.8343/138724712/9.97246e-08 total/count/average DEBUG - timers: all: 55.8183/1/55.8183 equate: 47.4425/101/0.469728 iterate: 8.32563/101/0.082432 total/count/average DEBUG - chunk: 3/3 cells: 12 iterations: 101 status: 0/12/0/0 converged/stopped/noreduction/singular DEBUG - counters: all: 86391360 flagged: 3105548 zero weight: 0 invalid residual: 0 invalid derivative: 0 invalid weight: 0 outlier: 0 DEBUG - timers: evaluate: 13.8485/149985/9.23329e-05 coeff map: 0.23516/599940/3.91972e-07 process cell: 15.3144/7199280/2.12721e-06 modify sample: 2.20826/83285812/2.65143e-08 condition eq: 8.24952/83285812/9.90507e-08 total/count/average DEBUG - timers: all: 35.5057/1/35.5057 equate: 30.4596/101/0.301581 iterate: 5.01388/101/0.0496423 total/count/average DEBUG - Estimate statistics: TIMER s ESTIMATE ALL total 147.615 count 1 avg 147.615 Timing of expression tree nodes not supported, please recompile with -DLOFAR_BBSKERNEL_EXPR_TIMER to enable this feature.

DEBUG - Executing a Finalize command: Command: Finalize INFO - bbs-reducer terminated successfully. INFO - clone CA : avg = 532 ns, total = 161 s, count = 303536310 INFO - new CA : avg = 76.9 ns, total = 23.4 s, count = 304252410

2016-04-08 05:51:09 WARNING node.lofar.calibrate_stand_alone.L343226_SBgr025-10_uv.dppp.pre-cal_chunk9_12656E813t_0g.mssort_into_Groups: /home/rvweeren/software/lofarapr2016/opt/LofIm/bin/bbs-reducer stderr: log4cplus:WARN Property configuration file "bbs-reducer.log_prop" not found. log4cplus:WARN Using basic logging configuration.

2016-04-08 05:51:12 DEBUG facetselfcal_facet_patch_349.executable_args: Results for job 9 submitted by ('127.0.0.1', 45207) 2016-04-08 05:51:12 INFO node.lofar.calibrate_stand_alone: Total time 813.8626s; user time: 810.0842s; system time: 1.8192s 2016-04-08 05:51:12 DEBUG node.lofar.calibrate_stand_alone: Start time was 1460108258.4141s; end time was 1460109072.2770s 2016-04-08 05:51:14 DEBUG facetselfcal_facet_patch_349.executable_args: Remote command stdout: Command: create tablename="/p3600/rinto_Groups/instrument"6_SBgr025-10_uv.dppp.pre-cal_chunk9_12656E813t0g.mssort Command: adddef Gain:0:0:Ampl values=1.0 Gain:0:0:Ampl type=scalar values: 1 shape=[1, 1] pert=1e-06 pert_rel=y Wrote new defaultvalue record for parameter Gain:0:0:Ampl Command: adddef Gain:1:1:Ampl values=1.0 Gain:1:1:Ampl type=scalar values: 1 shape=[1, 1] pert=1e-06 pert_rel=y Wrote new defaultvalue record for parameter Gain:1:1:Ampl Command: adddef Gain:0:0:Real values=1.0 Gain:0:0:Real type=scalar values: 1 shape=[1, 1] pert=1e-06 pert_rel=y Wrote new defaultvalue record for parameter Gain:0:0:Real Command: adddef Gain:1:1:Real values=1.0 Gain:1:1:Real type=scalar values: 1 shape=[1, 1] pert=1e-06 pert_rel=y Wrote new defaultvalue record for parameter Gain:1:1:Real Command: adddef DirectionalGain:0:0:Ampl values=1.0 DirectionalGain:0:0:Ampl type=scalar values: 1 shape=[1, 1] pert=1e-06 pert_rel=y Wrote new defaultvalue record for parameter DirectionalGain:0:0:Ampl Command: adddef DirectionalGain:1:1:Ampl values=1.0 DirectionalGain:1:1:Ampl type=scalar values: 1 shape=[1, 1] pert=1e-06 pert_rel=y Wrote new defaultvalue record for parameter DirectionalGain:1:1:Ampl Command: adddef DirectionalGain:0:0:Real values=1.0 DirectionalGain:0:0:Real type=scalar values: 1 shape=[1, 1] pert=1e-06 pert_rel=y Wrote new defaultvalue record for parameter DirectionalGain:0:0:Real Command: adddef DirectionalGain:1:1:Real values=1.0 DirectionalGain:1:1:Real type=scalar values: 1 shape=[1, 1] pert=1e-06 pert_rel=y Wrote new defaultvalue record for parameter DirectionalGain:1:1:Real Command: adddef Clock values=0.0, pert=1e-15 Clock type=scalar values: 0 shape=[1, 1] pert=1e-15 pert_rel=y Wrote new defaultvalue record for parameter Clock Command: quit

2016-04-08 05:51:14 WARNING facetselfcal_facet_patch_349.executable_args: Remote command stderr: /home/rvweeren/software/lofarapr2016/opt/LofIm/lib/python2.7/site-packages/lofarpipe/support/utilities.pyc : Using default subprocess module! log4cplus:WARN Property configuration file "parmdbm.log_prop" not found. log4cplus:WARN Using basic logging configuration.

2016-04-08 05:51:15 DEBUG facetselfcal_facet_patch_349.executable_args: Adding node_logging_information 2016-04-08 05:51:15 ERROR facetselfcal_facet_patch_349.executable_args: A job has failed and error_tolerance is not set. Bailing out! 2016-04-08 05:51:15 WARNING facetselfcal_facet_patch_349.executable_args: Note: recipe outputs are not complete 2016-04-08 05:51:15 WARNING facetselfcal_facet_patch_349.executable_args: recipe executable_args completed with errors 2016-04-08 05:51:15 WARNING facetselfcal_facet_patch_349: calibrate-stand-alone_new reports failure (using executable_args recipe) 2016-04-08 05:51:15 ERROR facetselfcal_facet_patch_349: 2016-04-08 05:51:15 ERROR facetselfcal_facet_patch_349: Failed pipeline run: facet_patch_349 2016-04-08 05:51:15 ERROR facetselfcal_facet_patch_349: Detailed exception information: 2016-04-08 05:51:15 ERROR facetselfcal_facet_patch_349: <class 'lofarpipe.support.lofarexceptions.PipelineRecipeFailed'> 2016-04-08 05:51:15 ERROR facetselfcal_facet_patch_349: calibrate-stand-alone_new failed 2016-04-08 05:51:15 ERROR facetselfcal_facet_patch_349: 2016-04-08 05:51:15 ERROR facetselfcal_facet_patch_349: LOFAR Pipeline finished unsuccesfully. 2016-04-08 05:51:15 WARNING facetselfcal_facet_patch_349: recipe facetselfcal_facet_patch_349 completed with errors

AHorneffer commented 8 years ago

One of the calls to BBS failed, and judging from the log-snipped apparently not the one to finish last. (Because Factor cannot deal well with some of the processes failing, it just stops there.)

Can you search in the log for the BBS call that actually failed? Or if you prefer: my guess is that the machine ran out of memory, so starting fewer BBS runs (reducing ncpu in the factor parset) should help.

rvweeren commented 8 years ago

The log is huge, I tried to find the bbs error but it is not clear what to me what string to search for ("error", "failure", "fault", did not turn up something).

I restarted and now it runs fine (would be surprised it is a memory issue since BBS uses only a few percent (in total) at most on my machine, but maybe it is indeed a "machine" related issue, interfering with some other process)

rvweeren commented 8 years ago

I will close this for now, since I'm not yet convinced it is a factor/bbs issue.