ratt-ru / meqtrees

A library for implementing radio astronomical Measurement Equations
http://meqtrees.net
10 stars 2 forks source link

Meqpipeliner hangs when connection is not established #879

Open bennahugo opened 6 years ago

bennahugo commented 6 years ago

Meqtrees pipeliner hangs when the connection to meqserver cannot be established. This causes pipelines to hang indefinitely.

running: docker start -a calibrator_Gjones_subtract_lsm0-140267809901120151517920471
running: /usr/bin/meqtree-pipeliner.py --mt 16 -c /code/tdlconf.profiles [stefcal] ms_sel.ms_read_flags=1 ms_sel.input_column=DATA ms_sel.field_index=0 ms_sel.msname=/home/jenkins/msdir/12A-405.sb7601493.eb10633016.56086.127048738424-corr.ms stefcal_gain.table=/home/jenkins/output/12A-405.sb7601493.eb10633016.56086.127048738424-corr.gain.cp tiggerlsm.lsm_subset=all ms_wfl.write_bitflag=stefcal do_output=CORR_DATA stefcal_gain.enabled=1 stefcal_gain.flag_chisq=0 ms_sel.ms_fill_legacy_flags=1 stefcal_gain.flag_ampl=1 ms_sel.ddid_index=0 ms_sel.tile_size=512 ms_sel.ms_write_flag_policy="'replace set'" ms_rfl.read_flagsets=-stefcal stefcal_gain.reset=1 stefcal_gain.freqint=64 stefcal_gain.implementation=GainDiagPhase stefcal_gain.flag_chisq_threshold=10 ms_rfl.read_legacy_flags=1 stefcal_gain.flag_ampl_low=0.15 ms_sel.ms_corr_sel='2x2' stefcal_gain.flag_ampl_high=2.0 ms_sel.ms_write_flags=1 stefcal_gain.mode=solve-save tiggerlsm.filename=/home/jenkins/output/vla_NGC417_LBand-LSM0.lsm.html stefcal_gain.timeint=20 ms_sel.output_column=CORRECTED_DATA /usr/local/lib/python2.7/dist-packages/Cattery/Calico/calico-stefcal.py =stefcal 
### Starting meqserver
Traceback (most recent call last):
  File "/usr/bin/meqtree-pipeliner.py", line 77, in <module>
    mqs = meqserver.default_mqs(wait_init=10,extra=["-mt",str(options.mt)]+(["-python_memprof"] if options.memprof else []));
  File "/usr/lib/python2.7/dist-packages/Timba/Apps/meqserver.py", line 262, in default_mqs
    mqs = meqserver(extra=extra,**args);
  File "/usr/lib/python2.7/dist-packages/Timba/Apps/meqserver.py", line 94, in __init__
    multiapp_proxy.__init__(self,appid,client_id,spawn=spawn,**kwargs);
  File "/usr/lib/python2.7/dist-packages/Timba/Apps/multiapp_proxy.py", line 211, in __init__
    self.ensure_connection(wait_init);
  File "/usr/lib/python2.7/dist-packages/Timba/Apps/multiapp_proxy.py", line 438, in ensure_connection
    raise RuntimeError,"timeout waiting for connection";
RuntimeError: timeout waiting for connection
o-smirnov commented 6 years ago

Can you point me at an easily reproducible case? I agree the pipeliner should bomb out with an error rather than hang, but the fact that it can't establish a connection in the first place is indicative of some other problem.

bennahugo commented 6 years ago

It is running the stimela script: https://jenkins.meqtrees.net/job/ddfacet-generate-refims/ws/stable/stimela-test-ngc417.py/*view*/ On this dataset: https://jenkins.meqtrees.net/job/ddfacet-generate-refims/ws/stable/12A-405.sb7601493.eb10633016.56086.127048738424.tgz

It is sporadic though - it works most of the time.