radical-cybertools / radical.pilot

RADICAL-Pilot
http://radical-cybertools.github.io/radical-pilot/index.html
Other
54 stars 23 forks source link

RAPTOR Default worker fails: 'DefaultWorker' object has no attribute '_res_addr_put' #3124

Closed AymenFJA closed 7 months ago

AymenFJA commented 7 months ago

This issue is blocking a Parsl-RP PR:

aymen@surfacebook:~/radical.pilot.sandbox/rpex.session.surfacebook.aymen.019759.0004/pilot.0000/rpex.worker.000000$ cat rpex.worker.000000.err
Traceback (most recent call last):
  File "/home/aymen/ve/test_rpex_final/bin/radical-pilot-raptor-worker", line 52, in <module>
    run(sys.argv[1], sys.argv[2], sys.argv[3])
  File "/home/aymen/ve/test_rpex_final/bin/radical-pilot-raptor-worker", line 30, in run
    worker = cls(raptor_id)
  File "/home/aymen/ve/test_rpex_final/lib/python3.8/site-packages/radical/pilot/raptor/worker_default.py", line 45, in __init__
    self._res_put = ru.zmq.Putter('result',  self._res_addr_put)
AttributeError: 'DefaultWorker' object has no attribute '_res_addr_put'
Traceback (most recent call last):
  File "/home/aymen/ve/test_rpex_final/bin/radical-pilot-raptor-worker", line 52, in <module>
    run(sys.argv[1], sys.argv[2], sys.argv[3])
  File "/home/aymen/ve/test_rpex_final/bin/radical-pilot-raptor-worker", line 30, in run
    worker = cls(raptor_id)
  File "/home/aymen/ve/test_rpex_final/lib/python3.8/site-packages/radical/pilot/raptor/worker_default.py", line 45, in __init__
    self._res_put = ru.zmq.Putter('result',  self._res_addr_put)
AttributeError: 'DefaultWorker' object has no attribute '_res_addr_put'
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[196,1],3]
  Exit code:    1
--------------------------------------------------------------------------

RCT Stack:

test_rpex_final) aymen@surfacebook:~/RPEX/RUN_RPEX$ radical-stack

  python               : /home/aymen/ve/test_rpex_final/bin/python3
  pythonpath           :
  version              : 3.8.10
  virtualenv           : /home/aymen/ve/test_rpex_final

  radical.gtod         : 1.41.0
  radical.pilot        : 1.46.2
  radical.saga         : 1.46.0
  radical.utils        : 1.46.0
AymenFJA commented 7 months ago

Session is attached here: archive.zip

mtitov commented 7 months ago

Fails due to this code (after I moved while loop under _manager) https://github.com/radical-cybertools/radical.pilot/blob/8e50e5269a0ee499d3d8585e56a04ebb91b067af/src/radical/pilot/raptor/worker.py#L110-L132

Every rank should have that while-loop, but only manager should send re-register command, for example

        # the manager (rank 0) registers the worker with the master
        if self._manager:
            self._log.debug('register: %s / %s', self._uid, self._raptor_id)
            self._ctrl_pub.put(rpc.CONTROL_PUBSUB, reg_msg)

        # wait for raptor response
        self._log.debug('wait for registration to complete')
        count = 0
        while not self._reg_event.wait(timeout=5):
            if count < self._hb_register_count:
                count += 1
                if self._manager:
                    self._log.debug('re-register: %s / %s', 
                                    self._uid, self._raptor_id)
                    self._ctrl_pub.put(rpc.CONTROL_PUBSUB, reg_msg)
            else:
                self.stop()
                self.join()
                self._log.error('registration with master timed out')
                raise RuntimeError('registration with master timed out')
andre-merzky commented 7 months ago

Oh, this seems to be a side affect of the changed worker registration maybe? Only the manager rank (rank 0) waits for the registry message now in the base class c'tor. Thanks, we should have a fix for this shortly - sorry for not catching it during review...

mtitov commented 7 months ago

@andre-merzky quick comment - that what I showed is still in devel, that part wasn't released yet

andre-merzky commented 7 months ago

@andre-merzky quick comment - that what I showed is still in devel, that part wasn't released yet

Ack. Let's see what @AymenFJA tests yield...

AymenFJA commented 7 months ago

@andre-merzky @mtitov , my test passed, and it works as expected. Thank you both. This ticket shall be closed once the PR is merged.