radical-collaboration / hpc-workflows

NSF16514 EarthCube Project - Award Number:1639694
5 stars 0 forks source link

EnTK errored on Cheyenne #144

Closed Weiming-Hu closed 2 years ago

Weiming-Hu commented 3 years ago

I have recently encountered the following error on Cheyenne.

Process task-manager:                                                                                                                                                                                                                                                                     
Traceback (most recent call last):                                                                                                                                                                                                                                                        
  File "/glade/u/home/wuh20/venv_Predictability/lib/python3.7/site-packages/radical/entk/execman/rp/task_manager.py", line 197, in _tmgr                                                                                                                                                  
    self._update_resource(body['body'])                                                                                                                                                                                                                                                   
  File "/glade/u/home/wuh20/venv_Predictability/lib/python3.7/site-packages/radical/entk/execman/rp/task_manager.py", line 254, in _update_resource                                                                                                                                       
    self._rp_tmgr.add_pilots(pilot)                                                                                                                                                                                                                                                       
  File "/glade/u/home/wuh20/venv_Predictability/lib/python3.7/site-packages/radical/pilot/task_manager.py", line 702, in add_pilots                                                                                                                                                       
    pid = pilot.uid                                                                                                                                                                                                                                                                       
AttributeError: 'dict' object has no attribute 'uid'                                                                                                                                                                                                                                      

The above exception was the direct cause of the following exception:                                                                                                                                                                                                                      

Traceback (most recent call last):               
  File "/glade/u/home/wuh20/venv_Predictability/lib/python3.7/site-packages/radical/entk/execman/rp/task_manager.py", line 208, in _tmgr                                              
    raise EnTKError(ex) from ex                              
radical.entk.exceptions.EnTKError: 'dict' object has no attribute 'uid'         

The above exception was the direct cause of the following exception:

Traceback (most recent call last):     
  File "/glade/u/apps/ch/opt/python/3.7.9/gnu/9.1.0/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap                             
    self.run()                     
  File "/glade/u/apps/ch/opt/python/3.7.9/gnu/9.1.0/lib/python3.7/multiprocessing/process.py", line 99, in run                     
    self._target(*self._args, **self._kwargs)
  File "/glade/u/home/wuh20/venv_Predictability/lib/python3.7/site-packages/radical/entk/execman/rp/task_manager.py", line 221, in _tmgr
    raise EnTKError(ex) from ex
radical.entk.exceptions.EnTKError: 'dict' object has no attribute 'uid'

My radical stake shows the following:

(venv_Predictability) wuh20@cheyenne3:~/github/pv-workflow/03_MultiConfigSimulation> radical-stack 

  python               : /glade/u/home/wuh20/venv_Predictability/bin/python3
  pythonpath           : 
  version              : 3.7.9
  virtualenv           : /glade/u/home/wuh20/venv_Predictability

  radical.analytics    : 1.5.0
  radical.entk         : 1.6.5
  radical.gtod         : 1.5.0
  radical.pilot        : 1.6.0
  radical.saga         : 1.5.9
  radical.utils        : 1.5.12

Any ideas? Thanks in advance.

Weiming-Hu commented 3 years ago

I have updated radical.entk with pip install radical.entk --upgrade. Is that sufficient for the update? Thanks.

Weiming-Hu commented 3 years ago

Actually, once I have upgraded all radical.* modules, I'm able to get pass that error. But I encountered a different error.

Process Process-1:
Traceback (most recent call last):
  File "/glade/u/apps/ch/opt/python/3.7.9/gnu/9.1.0/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/glade/u/apps/ch/opt/python/3.7.9/gnu/9.1.0/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/glade/scratch/wuh20/tmp/radical.pilot.sandbox/re.session.cheyenne3.wuh20.018816.0014/pilot.0000/rp_install/lib/python3.7/site-packages/radical/pilot/agent/scheduler/base.py", line 601, in _schedule_tasks
    r_wait, a = self._schedule_waitpool()
  File "/glade/scratch/wuh20/tmp/radical.pilot.sandbox/re.session.cheyenne3.wuh20.018816.0014/pilot.0000/rp_install/lib/python3.7/site-packages/radical/pilot/agent/scheduler/base.py", line 679, in _schedule_waitpool
    log=self._log)
ValueError: too many values to unpack (expected 2)

My current stack is as follows:

(venv_Predictability) wuh20@cheyenne3:~/scratch/tmp/radical.pilot.sandbox/re.session.cheyenne3.wuh20.018816.0014/pilot.0000> radical-stack

  python               : /glade/u/home/wuh20/venv_Predictability/bin/python3
  pythonpath           :
  version              : 3.7.9
  virtualenv           : /glade/u/home/wuh20/venv_Predictability

  radical.analytics    : 1.6.7
  radical.entk         : 1.6.5
  radical.gtod         : 1.5.0
  radical.pilot        : 1.6.6
  radical.saga         : 1.6.6
  radical.utils        : 1.6.7
andre-merzky commented 3 years ago

Hi @Weiming-Hu - we pushed a round of releases for RU, RS and RP. Would you mind updating your stack once more and check if the error persists? Specifically the last error (mismatch on return values from schedule_waitpool should not happen on that stack, AFICT. Thanks!

Weiming-Hu commented 3 years ago

Thank you! I have upgraded to the following stack:

(venv_Predictability) wuh20@cheyenne1:~/github/pv-workflow/03_MultiConfigSimulation/re.session.cheyenne1.wuh20.018820.0002> radical-stack

  python               : /glade/u/home/wuh20/venv_Predictability/bin/python3
  pythonpath           :
  version              : 3.7.9
  virtualenv           : /glade/u/home/wuh20/venv_Predictability

  radical.analytics    : 1.6.7
  radical.entk         : 1.6.5
  radical.gtod         : 1.6.7
  radical.pilot        : 1.6.7
  radical.saga         : 1.6.8
  radical.utils        : 1.6.7

But it seems that the process got stuck at submitting. Please see the below printout log:

Validating and assigning resource manager                                     ok
Setting up RabbitMQ system                                                   n/a
new session: [re.session.cheyenne1.wuh20.018820.0002]                          \
database   : [mongodb://hpcw-psu:****@129.114.17.185:27017/hpcw-psu]          ok
create pilot manager                                                          ok
submit 1 pilot(s)
        pilot.0000   ncar.cheyenne_mpt       216 cores       0 gpus           ok
All components created
create task managerUpdate: pipeline.0000 state: SCHEDULING
Update: pipeline.0000.stage.0000 state: SCHEDULING
Update: pipeline.0000.stage.0000.Regime000Chunk000Analogs state: SCHEDULING
Update: pipeline.0000.stage.0000.Regime000Chunk000Analogs state: SCHEDULED
Update: pipeline.0000.stage.0000 state: SCHEDULED
MongoClient opened before fork. Create MongoClient only after forking. See PyMongo's documentation for details: https://pymongo.readthedocs.io/en/stable/faq.html#is-pymongo-fork-safe
                                                           ok
submit: ########################################################################
Update: pipeline.0000.stage.0000.Regime000Chunk000Analogs state: SUBMITTING
Weiming-Hu commented 3 years ago

I have attached my client side log, re.session.cheyenne1.wuh20.018820.0002.tar.gz, and the server side log, re.session.cheyenne1.wuh20.018820.0002.server.tar.gz. I didn't find any errors though. I used grep Error *. Maybe you have a different way to spot any errors?

mturilli commented 3 years ago

@Weiming-Hu can you confirm that this is fixed now?

Weiming-Hu commented 2 years ago

I'm sorry for not responding to this ticket earlier. This has been long fixed and actually the paper has been already submitted to Data in Brief per our previous discussions through emails. Please let me know if you have any questions. Closing...