radical-collaboration / hpc-workflows

NSF16514 EarthCube Project - Award Number:1639694
5 stars 0 forks source link

AnEn MPI with EnTK resource/pipeline specification details #115

Closed Weiming-Hu closed 3 years ago

Weiming-Hu commented 4 years ago

Hi,

I'm trying to run my updated AnEn executable with MPI with EnTK.

First of all, do you already have some demos/examples of related use cases that I can refer to?

I have coded up my own trial version for resources and pipeline specification but it seems to hang during/after task submission stage (of which exact stage I'm not sure).

Please see the following standard output.

EnTK session: re.session.geogadmins-Air.wuh20.018340.0000                                                                                                                                                                                                                                                              [13/261
Creating AppManagerSetting up RabbitMQ system                                 ok                                                                                                                                                                                                                                              
                                                                              ok                                                                                                                                                                                                                                              
Validating and assigning resource manager                                     ok                                                                                                                                                                                                                                              
Adding task 1: task-anen-gen-00000                                                                                                                                                                                                                                                                                            
Adding task 2: task-anen-gen-00001                                                                                                                                                                                                                                                                                            
Adding task 3: task-anen-gen-00002                                                                                                                                                                                                                                                                                            
Adding task 4: task-anen-gen-00003                                                                                                                                                                                                                                                                                            
Adding task 5: task-anen-gen-00004                                                                                                                                                                                                                                                                                            
Adding task 6: task-anen-gen-00005                                                                                                                                                                                                                                                                                            
Adding stage stage-analogs.                                                                                                                                                                                                                                                                                                   
Setting up RabbitMQ system                                                   n/a                                                                                                                                                                                                                                              
new session: [re.session.geogadmins-Air.wuh20.018340.0000]                     \                                                                                                                                                                                                                                              
database   : [mongodb://hpcworkflows:hpcw0rkf70w@two.radical-project.org:27017/hpcworkflows]                                                                                                                                                                                                                                  
        ok                                                                     
create pilot manager                                                          ok                                                                               
submit 1 pilot(s)                                                              
        [xsede.stampede2_srun:136]                                             
                                                                              ok                                                                               
All components created                                                         
create unit managerUpdate: pipeline.0000 state: SCHEDULING                     
Update: pipeline.0000.stage-analogs state: SCHEDULING                          
Update: pipeline.0000.stage-analogs.task-anen-gen-00002 state: SCHEDULING                                                                                      
Update: pipeline.0000.stage-analogs.task-anen-gen-00005 state: SCHEDULING                                                                                      
Update: pipeline.0000.stage-analogs.task-anen-gen-00004 state: SCHEDULING                                                                                      
Update: pipeline.0000.stage-analogs.task-anen-gen-00003 state: SCHEDULING                                                                                      
Update: pipeline.0000.stage-analogs.task-anen-gen-00001 state: SCHEDULING                                                                                      
Update: pipeline.0000.stage-analogs.task-anen-gen-00000 state: SCHEDULING                                                                                      
Update: pipeline.0000.stage-analogs.task-anen-gen-00002 state: SCHEDULED                                                                                       
Update: pipeline.0000.stage-analogs.task-anen-gen-00005 state: SCHEDULED                                                                                       
Update: pipeline.0000.stage-analogs.task-anen-gen-00004 state: SCHEDULED                                                                                       
Update: pipeline.0000.stage-analogs.task-anen-gen-00003 state: SCHEDULED                                                                                       
Update: pipeline.0000.stage-analogs.task-anen-gen-00001 state: SCHEDULED                                                                                       
Update: pipeline.0000.stage-analogs.task-anen-gen-00000 state: SCHEDULED                                                                                       
Update: pipeline.0000.stage-analogs state: SCHEDULED                           
/Users/wuh20/venv/lib/python3.7/site-packages/pymongo/topology.py:155: UserWarning: MongoClient opened before fork. Create MongoClient only after forking. See PyMongo's documentation for details: http://api.mongodb.org/python/current/faq.html#is-pymongo-fork-safe
  "MongoClient opened before fork. Create MongoClient only "                   
                                                           ok                  
Update: pipeline.0000.stage-analogs.task-anen-gen-00002 state: SUBMITTING                                                                                      
Update: pipeline.0000.stage-analogs.task-anen-gen-00005 state: SUBMITTING                                                                                      
Update: pipeline.0000.stage-analogs.task-anen-gen-00004 state: SUBMITTING                                                                                      
Update: pipeline.0000.stage-analogs.task-anen-gen-00003 state: SUBMITTING                                                                                      
Update: pipeline.0000.stage-analogs.task-anen-gen-00001 state: SUBMITTING                                                                                      
Update: pipeline.0000.stage-analogs.task-anen-gen-00000 state: SUBMITTING                                                                                      
submit: ########################################################################

At this point, I already have unit folders. But inside the folder, I only have the configuration file, missing the other scripts.

login2.stampede2(1009)$ cd re.session.geogadmins-Air.wuh20.018340.0000/pilot.0000/unit.000000/
login2.stampede2(1010)$ ls
anen_shared_config.cfg

I have attached the client box and sandboxes for further references.

Thank you

client-box_re.session.geogadmins-Air.wuh20.018340.0000.tar.gz

sandbox_re.session.geogadmins-Air.wuh20.018340.0000.tar.gz

lee212 commented 4 years ago

How many cpus (nodes) the job requires and each task?

When I see workflow_cfg.yml (task cpu requirement) and resource_cfg_stampede.yml (job cpu requirement), the numbers are identical but if you have n concurrent tasks (seems you have 6 tasks), the total cpu counts you request need to be n times of each task requirement. For example, if you use 68 cores (KNL provides) per task, your resource_cfg_stampede.yml need to include 408 = 68 * 6 as a cpu count. This will ask 2 nodes on stampede2 as 272 processes are available in a single node.

Weiming-Hu commented 4 years ago

Here is an overview of my task.

I hope this is helpful. Thank you

mturilli commented 4 years ago

Blocked #116

mturilli commented 4 years ago

This is now unblocked.

Weiming-Hu commented 3 years ago

Tests with the default branch of EnTK show different errors on different platforms.

  1. Test on Cheyenne head node: Could not resolve hostname

cheyenne_head_node_error

  1. Test on local desktop: A call-out command failed. In the re.session log files, I found the following error:
pmgr_launching.0000.log:RuntimeError: callout failed: cd /tmp/rp_agent_tar_dirnxxq8rr6 && tar zchf /tmp/rp_agent_tar_dirnxxq8rr6/re.session.sapphire.geog.psu.edu.wuh20.018516.0001.pmgr_launching.0000.tgz *
  1. Test on local laptop: Below is the print out messages from EnTK:
(venv) wuh20@geogadmins-MacBook-Air use-entk % python python_me.py --workflow workflow.yaml --resource resource.yaml
EnTK session: re.session.geogadmins-MacBook-Air.local.wuh20.018516.0002
Creating AppManagerSetting up RabbitMQ system                                 ok
                                                                              ok
Validating and assigning resource manager                                     ok
Setting up RabbitMQ system                                                   n/a
new session: [re.session.geogadmins-MacBook-Air.local.wuh20.018516.0002]       \
database   : [mongodb://hpcw-psu:F9gT2XU2X8oLacxg@129.114.17.185:27017/hpcw-psu]
        ok
create pilot manager                                                          ok
submit 1 pilot(s)
        Traceback (most recent call last):
  File "/Users/wuh20/venv/lib/python3.7/site-packages/radical/entk/appman/appmanager.py", line 416, in run
    self._rmgr._submit_resource_request()
  File "/Users/wuh20/venv/lib/python3.7/site-packages/radical/entk/execman/rp/resource_manager.py", line 171, in _submit_resource_request
    self._pilot = self._pmgr.submit_pilots(pdesc)
  File "/Users/wuh20/venv/lib/python3.7/site-packages/radical/pilot/pilot_manager.py", line 565, in submit_pilots
    pilot = ComputePilot(pmgr=self, descr=pd)
  File "/Users/wuh20/venv/lib/python3.7/site-packages/radical/pilot/compute_pilot.py", line 100, in __init__
    self._resource_sandbox = self._session._get_resource_sandbox(pilot)
  File "/Users/wuh20/venv/lib/python3.7/site-packages/radical/pilot/session.py", line 772, in _get_resource_sandbox
    shell = self.get_js_shell(resource, schema)
  File "/Users/wuh20/venv/lib/python3.7/site-packages/radical/pilot/session.py", line 818, in get_js_shell
    shell = rsup.PTYShell(js_url, self)
  File "/Users/wuh20/venv/lib/python3.7/site-packages/radical/saga/utils/pty_shell.py", line 245, in __init__
    interactive=self.interactive)
  File "/Users/wuh20/venv/lib/python3.7/site-packages/radical/saga/utils/pty_shell_factory.py", line 207, in initialize
    self._initialize_pty(info['pty'], info)
  File "/Users/wuh20/venv/lib/python3.7/site-packages/radical/saga/utils/pty_shell_factory.py", line 428, in _initialize_pty
    raise ptye.translate_exception (e)
  File "/Users/wuh20/venv/lib/python3.7/site-packages/radical/saga/utils/pty_shell_factory.py", line 301, in _initialize_pty
    raise rse.NoSuccess("Could not detect shell prompt (timeout)")
radical.saga.exceptions.NoSuccess: Could not detect shell prompt (timeout) (/Users/wuh20/venv/lib/python3.7/site-packages/radical/saga/utils/pty_shell_factory.py +301 (_initialize_pty)  :  raise rse.NoSuccess("Could not detect shell prompt (timeout)"))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "python_me.py", line 113, in <module>
    amgr.run()
  File "/Users/wuh20/venv/lib/python3.7/site-packages/radical/entk/appman/appmanager.py", line 446, in run
    self.terminate()
  File "/Users/wuh20/venv/lib/python3.7/site-packages/radical/entk/appman/appmanager.py", line 485, in terminate
    write_session_description(self)
  File "/Users/wuh20/venv/lib/python3.7/site-packages/radical/entk/utils/prof_utils.py", line 145, in write_session_description
    tree[amgr._uid]['children'].append(wfp._uid)
AttributeError: 'NoneType' object has no attribute '_uid'

It is complaining about shell connection to Cheyenne. But when I ssh Cheyenne, it works in the console:

(venv) wuh20@geogadmins-MacBook-Air re.session.geogadmins-MacBook-Air.local.wuh20.018516.0002 % ssh cheyenne.ucar.edu
Last login: Fri Sep 11 08:58:42 2020 from 68.232.113.234

******************************************************************************
*                 Welcome to Cheyenne - September 11, 2020
******************************************************************************
                 Today in the Daily Bulletin (dailyb.cisl.ucar.edu)

    - How to avoid failures when copying HPSS data
    - Duo support for older model phones ending on December 1

Quick Start:          www2.cisl.ucar.edu/resources/cheyenne/quick-start-cheyenne
User environment:     www2.cisl.ucar.edu/resources/cheyenne/user-environment
Key module commands:  module list, module avail, module spider, module help
CISL Help:            support.ucar.edu -- 303-497-2400
--------------------------------------------------------------------------------

wuh20@cheyenne3:~> 

I hope this is helpful. Thank you.

Weiming-Hu commented 3 years ago

Hi folks, unfortunately, even when I set schema in the resource file, I still had some issues.

(venv) wuh20@cheyenne2:~/github/pv-workflow/02_simulate/use-entk> python python_me.py --workflow workflow.yaml --resource resource.yaml 
EnTK session: re.session.cheyenne2.wuh20.018530.0002
Creating AppManagerSetting up RabbitMQ system                                 ok
                                                                              ok
Validating and assigning resource manager                                     ok
Setting up RabbitMQ system                                                   n/a
new session: [re.session.cheyenne2.wuh20.018530.0002]                          \
database   : [mongodb://hpcw-psu:****@129.114.17.185:27017/hpcw-psu]          ok
create pilot manager                                                          ok
submit 1 pilot(s)
        pilot.0000   ncar.cheyenne           204 cores       0 gpus           ok
closing session re.session.cheyenne2.wuh20.018530.0002                         \
close pilot manager                                                            \
wait for 1 pilot(s)
              0                                                               ok
                                                                              ok
session lifetime: 10.7s                                                       ok
wait for 1 pilot(s)
              0                                                          timeout
All components terminated
Traceback (most recent call last):
  File "/glade/u/home/wuh20/venv/lib/python3.7/site-packages/radical/entk/execman/rp/resource_manager.py", line 179, in _submit_resource_request
    self._pilot.wait([rp.PMGR_ACTIVE, rp.DONE, rp.FAILED, rp.CANCELED])
  File "/glade/u/home/wuh20/venv/lib/python3.7/site-packages/radical/pilot/compute_pilot.py", line 536, in wait
    time.sleep(0.1)
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/glade/u/home/wuh20/venv/lib/python3.7/site-packages/radical/entk/appman/appmanager.py", line 417, in run
    self._rmgr._submit_resource_request()
  File "/glade/u/home/wuh20/venv/lib/python3.7/site-packages/radical/entk/execman/rp/resource_manager.py", line 192, in _submit_resource_request
    raise KeyboardInterrupt
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "python_me.py", line 113, in <module>
    amgr.run()
  File "/glade/u/home/wuh20/venv/lib/python3.7/site-packages/radical/entk/appman/appmanager.py", line 442, in run
    raise KeyboardInterrupt
KeyboardInterrupt

I grep the following errors from the log files:

(venv) wuh20@cheyenne2:~/github/pv-workflow/02_simulate/use-entk> grep -R ERROR re.session.cheyenne2.wuh20.018530.0002/
re.session.cheyenne2.wuh20.018530.0002/pmgr_launching.0000.log:1601046473.000 : pmgr_launching.0000  : 51984 : 140196924974848 : ERROR    : bulk launch failed
re.session.cheyenne2.wuh20.018530.0002/radical.entk.resource_manager.0000.log:1601046474.919 : radical.entk.resource_manager.0000 : 51494 : 140256784967424 : ERROR    : Execution interrupted (probably by Ctrl+C) exit callback thread gracefully...
re.session.cheyenne2.wuh20.018530.0002/radical.entk.appmanager.0000.log:1601046474.919 : radical.entk.appmanager.0000 : 51494 : 140256784967424 : ERROR    : Execution interrupted by user (you probably hit Ctrl+C), trying to cancel enqueuer thread gracefully...
re.session.cheyenne2.wuh20.018530.0002/pmgr.0000.log:1601046473.002 : pmgr.0000            : 51494 : 140243634222848 : ERROR    : [Callback]: pilot 'pilot.0000' failed (exit)
re.session.cheyenne2.wuh20.018530.0002/pmgr.0000.log:1601046473.003 : pmgr.0000            : 51494 : 140243634222848 : ERROR    : listener died

This is how my resource file looks like

(venv) wuh20@cheyenne2:~/github/pv-workflow/02_simulate/use-entk> cat resource.yaml 
# Resource configuration on NCAR Cheyenne

rabbitmq:
  hostname: '129.114.17.185'
  port: 5672
  username: 'hpcw-psu'
  password: [not shown here]

resource-desc:
  name: 'ncar.cheyenne'
  walltime: 300
  cpus: 204
  gpus: 0
  queue: 'regular'
  project: 'URTG0014'
  schema: 'local'
Weiming-Hu commented 3 years ago

This is my stack info

(venv) wuh20@cheyenne2:~/github/pv-workflow/02_simulate/use-entk> radical-stack 

  python               : 3.7.5
  pythonpath           : 
  virtualenv           : /glade/u/home/wuh20/venv

  radical.entk         : 1.5.1
  radical.gtod         : 1.5.0
  radical.pilot        : 1.5.2
  radical.saga         : 1.5.2
  radical.utils        : 1.5.3

The session folder would be /glade/u/home/wuh20/github/pv-workflow/02_simulate/use-entk/re.session.cheyenne2.wuh20.018530.0002.

The sandbox is at /glade/scratch/wuh20/tmp/rp_agent_tar_dirr21yxevo.

Weiming-Hu commented 3 years ago

Documented here