Closed AymenFJA closed 7 months ago
Session is attached here: archive.zip
Fails due to this code (after I moved while
loop under _manager
)
https://github.com/radical-cybertools/radical.pilot/blob/8e50e5269a0ee499d3d8585e56a04ebb91b067af/src/radical/pilot/raptor/worker.py#L110-L132
Every rank should have that while
-loop, but only manager should send re-register command, for example
# the manager (rank 0) registers the worker with the master
if self._manager:
self._log.debug('register: %s / %s', self._uid, self._raptor_id)
self._ctrl_pub.put(rpc.CONTROL_PUBSUB, reg_msg)
# wait for raptor response
self._log.debug('wait for registration to complete')
count = 0
while not self._reg_event.wait(timeout=5):
if count < self._hb_register_count:
count += 1
if self._manager:
self._log.debug('re-register: %s / %s',
self._uid, self._raptor_id)
self._ctrl_pub.put(rpc.CONTROL_PUBSUB, reg_msg)
else:
self.stop()
self.join()
self._log.error('registration with master timed out')
raise RuntimeError('registration with master timed out')
Oh, this seems to be a side affect of the changed worker registration maybe? Only the manager rank (rank 0) waits for the registry message now in the base class c'tor. Thanks, we should have a fix for this shortly - sorry for not catching it during review...
@andre-merzky quick comment - that what I showed is still in devel, that part wasn't released yet
@andre-merzky quick comment - that what I showed is still in devel, that part wasn't released yet
Ack. Let's see what @AymenFJA tests yield...
@andre-merzky @mtitov , my test passed, and it works as expected. Thank you both. This ticket shall be closed once the PR is merged.
This issue is blocking a Parsl-RP PR:
RCT Stack: