ratt-ru / meqtrees

A library for implementing radio astronomical Measurement Equations
http://meqtrees.net
10 stars 2 forks source link

Occasional failure of meqserver/python in batch mode #793

Open gijzelaerr opened 10 years ago

gijzelaerr commented 10 years ago
at 2011-02-21 23:31:48 Tony Willis reported:

Occasional failure of meqserver/python in batch mode

gijzelaerr commented 10 years ago

Original comment thread migrated from bugzilla

at 2011-02-21 23:31:48 Tony Willis replied:

Occasionally, when running in batch mode, the meqserver can die or halt in an ungraceful way. The python traceback produces the output given at the end of this report. A complete simulation is available for testing in the directory ~twillis/ASKAP_demo on birch.

Output from a failed run ...

meqserver(meqserver.py:289:stop_default_mqs): meqserver not exiting cleanly, killing it 8Python: =================== stopping OCTOPUSSY ======================== 254 0.00037146 -0.00032291 0.0 10.642 1400000000.0 ========== Running batch example sim Stopping meqserver Bye! Traceback (most recent call last): File "batch_sim_two.py", line 22, in mod._tdl_job_1_simulate_MS(mqs,None,wait=True); File "example-sim.py", line 186, in _tdl_job_1_simulate_MS mqs.execute('VisDataMux',mssel.create_io_request(),wait=wait); File "/home/gmims/twillis/Timba/install/symlinked-release/libexec/python/Timba/Apps/meqserver.py", line 173, in execute return self.meq('Node.Execute',rec,wait=wait); File "/home/gmims/twillis/Timba/install/symlinked-release/libexec/python/Timba/Apps/meqserver.py", line 126, in meq msg = self.await(replyname,resume=True,timeout=wait); File "/home/gmims/twillis/Timba/install/symlinked-release/libexec/python/Timba/Apps/multiapp_proxy.py", line 518, in await res = self._pwp.await(self._rcv_prefix + what,timeout=await_timeout,resume=resume); File "/home/gmims/twillis/Timba/install/symlinked-release/libexec/python/Timba/octopussy.py", line 433, in await self.resume_events(); File "/home/gmims/twillis/Timba/install/symlinked-release/libexec/python/Timba/octopussy.py", line 413, in resume_events self._lock.release(); File "/home/gmims/twillis/local/lib/python2.6/threading.py", line 136, in release raise RuntimeError("cannot release un-aquired lock") RuntimeError: cannot release un-aquired lock

at 2011-02-22 22:13:25 Oleg Smirnov replied:

Looks like it's related to (or the same thing as) bug 573, which ought to tell you how old this thing is!

As already discussed by e-mail: have you got a script that produces it consistently on, say, birch?

at 2011-02-22 23:06:53 Tony Willis replied:

Yep, go to /home/twillis/ASKAP_demo on birch. There's a README there which gives you the command to do things in batch mode. You may want to change my sky model to something tigger-compatible. This failure happens sufficiently often that its annoying, especially if its near the end of a 6 hr job!

at 2011-04-07 21:43:50 Tony Willis replied:

Created an attachment (id=73) a dump out of a batch processing failure

This random failure during or at the end of batch processing keeps creeping up!

at 2011-04-11 11:52:07 Oleg Smirnov replied:

That's interesting, because it resolutely refuses to happen with the sims I'm doing over here. I'm copying your data over to Cape Town to give it a try (I presume the same directory on birch is still good?), to see if it's perhaps also dependent on Linux variant. If not, then it must be related to some MeqTrees feature in your simulation that I'm not using in mine, which is at least a data point. I shall keep looking, anyway.

at 2011-06-30 13:30:35 Oleg Smirnov replied:

* Bug 764 has been marked as a duplicate of this bug. *

at 2011-06-30 15:25:20 Oleg Smirnov replied:

* Bug 573 has been marked as a duplicate of this bug. *

at 2011-07-15 12:58:46 Oleg Smirnov replied:

OK, I think I have this licked in the current version (r8286). Serious stress- testing of Tony's code has yet to yield a crash. Tony: please do some testing.

There is still an underlying problem (bug 576) that I have at best worked around, not properly fixed. But a real fix is too complicated at this stage, so we'll have to leave it until the next release cycle. I'm therefore downgrading this bug, and taking off the release milestone.