radical-collaboration / hpc-workflows

NSF16514 EarthCube Project - Award Number:1639694
5 stars 0 forks source link

ENTK seems to break after Summit OS Update #150

Closed wjlei1990 closed 2 years ago

wjlei1990 commented 3 years ago

Hi,

After Summit system OS update this week, I always got this error after submitting jobs:

EnTK session: re.session.login3.lei.018866.0003
Creating AppManagerSetting up RabbitMQ system                                 ok
                                                                              ok
Validating and assigning resource manager                                     ok
Setting up RabbitMQ system                                                   n/a
new session: [re.session.login3.lei.018866.0003]                               \
database   : [mongodb://hpcw-pr:****@129.114.17.185:27017/hpcw-pr]            ok
create pilot manager                                                          ok
submit 1 pilot(s)
        pilot.0000   ornl.summit             336 cores      12 gpus           ok
closing session re.session.login3.lei.018866.0003                              \
close pilot manager                                                            \
wait for 1 pilot(s)
              0                                                               ok
                                                                              ok
session lifetime: 176.7s                                                      ok
wait for 1 pilot(s)
              0                                                          timeout
Execution failed, error: 'NoneType' object has no attribute '_uid'
Traceback (most recent call last):
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/appman/appmanager.py", line 438, in run
    self._rmgr.submit_resource_request()
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/execman/rp/resource_manager.py", line 199, in submit_resource_request
    self._pilot.wait([rp.PMGR_ACTIVE, rp.DONE, rp.FAILED, rp.CANCELED])
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/pilot/pilot.py", line 558, in wait
    time.sleep(0.1)
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "entk.hrlee.py", line 184, in main
    appman.run()
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/appman/appmanager.py", line 462, in run
    self.terminate()
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/appman/appmanager.py", line 507, in terminate
    write_session_description(self)
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/utils/prof_utils.py", line 148, in write_session_description
    tree[amgr._uid]['children'].append(wfp._uid)
AttributeError: 'NoneType' object has no attribute '_uid'

I re-installed erlang (Erlang/OTP 25 [DEVELOPMENT] [erts-12.0.3]) and rabbitmq_server-3.9.4. Due to the system update, the old version installed on summit just broke and not working any more.

Any thoughts on the failure? Is it my issue or entk issue?

(summit-entk) lei@login3 /gpfs/alpine/world-shared/geo111/lei/entk.small $ 
radical-stack

  python               : /ccs/home/lei/.conda/envs/summit-entk/bin/python3
  pythonpath           : /sw/summit/xalt/1.2.1/site:/sw/summit/xalt/1.2.1/libexec
  version              : 3.7.6
  virtualenv           : summit-entk

  radical.analytics    : 1.6.7
  radical.entk         : 1.6.7
  radical.gtod         : 1.6.7
  radical.pilot        : 1.6.7
  radical.saga         : 1.6.10
  radical.utils        : 1.6.7
wjlei1990 commented 3 years ago

New Error occured!

EnTK session: re.session.login3.lei.018871.0000                                                                                                                                                                                  
Creating AppManagerSetting up RabbitMQ system                                 ok                                                                                                                                                 
                                                                              ok                                                                                                                                                 
Validating and assigning resource manager                                     ok                                                                                                                                                 
Setting up RabbitMQ system                                                   n/a                                                                                                                                                 
new session: [re.session.login3.lei.018871.0000]                               \                                                                                                                                                 
database   : [mongodb://hpcw-pr:****@129.114.17.185:27017/hpcw-pr]           err                                                                                                                                                 
Execution failed, error: 'NoneType' object has no attribute '_uid'                                                                                                                                                               
Traceback (most recent call last):                                                                                                                                                                                               
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/pilot/session.py", line 183, in _initialize_primary                                                                                            
    cfg=self._cfg, log=self._log)                                                                                                                                                                                                
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/pilot/db/database.py", line 49, in __init__
    self._mongo, self._db, _, _, _ = ru.mongodb_connect(str(dburl))                                              
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/utils/misc.py", line 135, in mongodb_connect
    db.authenticate(user, pwd)                                                                                   
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/pymongo/database.py", line 1471, in authenticate
    connect=True)                                                                                                
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/pymongo/mongo_client.py", line 750, in _cache_credentials
    writable_preferred_server_selector)                                                                          
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/pymongo/topology.py", line 235, in select_server
    address))                                                                                                    
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/pymongo/topology.py", line 193, in select_servers
    selector, server_timeout, address)                                                                           
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/pymongo/topology.py", line 209, in _select_servers_loop
    self._error_message(selector))                                                                               
pymongo.errors.ServerSelectionTimeoutError: 129.114.17.185:27017: [Errno 111] Connection refused                 

The above exception was the direct cause of the following exception:                                                                                                                                                             

Traceback (most recent call last):
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/execman/rp/resource_manager.py", line 157, in submit_resource_request
    self._session = rp.Session(uid=self._sid)                                                                    
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/pilot/session.py", line 153, in __init__
    self._initialize_primary(dburl)
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/pilot/session.py", line 198, in _initialize_primary
    dburl_no_passwd) from e                             
RuntimeError: session create failed [mongodb://hpcw-pr:****@129.114.17.185:27017/hpcw-pr]                                                                                                                                        

The above exception was the direct cause of the following exception:                                                                                                                                                             

Traceback (most recent call last):
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/appman/appmanager.py", line 438, in run
    self._rmgr.submit_resource_request()
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/execman/rp/resource_manager.py", line 216, in submit_resource_request
    raise EnTKError(ex) from ex
radical.entk.exceptions.EnTKError: session create failed [mongodb://hpcw-pr:****@129.114.17.185:27017/hpcw-pr]                                                                                                                   

During handling of the above exception, another exception occurred:    

Does it mean the mongodb breaks?

database   : [mongodb://hpcw-pr:****@129.114.17.185:27017/hpcw-pr]           err                   
mturilli commented 3 years ago

Hi @wjlei1990 , unfortunately 129.114.17.185 is momentarily offline. I am working on bringing it back up ASAP. Meanwhile, could you try to use our services deployed at ORNL? @lee212 could you give Wenjie the details of the endpoints and how to use them?

lee212 commented 3 years ago

@wjlei1990 , I sent the information over Slack, let me know if you didn't receive the notification or had an issue with it.

wjlei1990 commented 3 years ago

@lee212 @mturilli

Thanks for the mongodb update. This part now works.

However, ETNK still has some issues running the job on summit:

EnTK session: re.session.login5.lei.018874.0003
Creating AppManagerSetting up RabbitMQ system                                 ok
                                                                              ok
Validating and assigning resource manager                                     ok
Setting up RabbitMQ system                                                   n/a
new session: [re.session.login5.lei.018874.0003]                               \
database   : [mongodb://rct:****@apps.marble.ccs.ornl.gov:32020/rct_test]     ok
create pilot manager                                                          ok
submit 1 pilot(s)
        pilot.0000   ornl.summit             336 cores      12 gpus           ok
closing session re.session.login5.lei.018874.0003                              \
close pilot manager                                                            \
wait for 1 pilot(s)
              0                                                               ok
                                                                              ok
session lifetime: 156.5s                                                      ok
wait for 1 pilot(s)
              0                                                          timeout
Execution failed, error: 'NoneType' object has no attribute '_uid'
Traceback (most recent call last):
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/appman/appmanager.py", line 438, in run
    self._rmgr.submit_resource_request()
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/execman/rp/resource_manager.py", line 199, in submit_resource_request
    self._pilot.wait([rp.PMGR_ACTIVE, rp.DONE, rp.FAILED, rp.CANCELED])
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/pilot/pilot.py", line 558, in wait
    time.sleep(0.1)
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "entk.hrlee.py", line 187, in main
    appman.run()
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/appman/appmanager.py", line 462, in run
    self.terminate()
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/appman/appmanager.py", line 507, in terminate
    write_session_description(self)
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/utils/prof_utils.py", line 148, in write_session_description
    tree[amgr._uid]['children'].append(wfp._uid)
AttributeError: 'NoneType' object has no attribute '_uid'

This is what I got from ENTK sandbox:

ls re.session.login5.lei.018874.0003/pilot.0000/
agent.0.cfg  bootstrap_0.err  bootstrap_0.out  bootstrap_0.sh  deactivate  env.orig

The job seems to be running in the queue on Summit. However, entk seems to fail at launch tasks.

lee212 commented 2 years ago

Update: release has been made, pypi installation will reflect this update. you can get rp:

pip install --upgrade radical.pilot

This issue has been addressed and merged to the devel branch, we will have this fix in the next release. In the meantime, would you be able to install radical.pilot from the github repo?

The installation instruction of getting the devel branch is removing your current installation first and installing it from the git repo like:

pip uninstall radical.pilot
pip install git+https://github.com/radical-cybertools/radical.pilot@devel

The specific PR is merged to devel: https://github.com/radical-cybertools/radical.pilot/pull/2439

wjlei1990 commented 2 years ago

After testing it on Summit, now the task directory is generated from pilot.0000. However, tasks are still not running properly. I n the task.0000.err file, I found this:

Failed to bind process to ERF smt array, err: Invalid argument
Failed to bind process to ERF smt array, err: Invalid argument
Failed to bind process to ERF smt array, err: Invalid argument
Failed to bind process to ERF smt array, err: Invalid argument
Failed to bind process to ERF smt array, err: Invalid argument

This is my script for creating task:

def create_task(task_dir):                                                      

    t1 = Task()                                                                 
    t1.pre_exec = [                                                             
        'cd {}'.format(task_dir),                                                                                        
        'unset CUDA_VISIBLE_DEVICES',                                 
        "export OMP_NUM_THREADS=1"                                              
    ]                                                                           
    t1.executable = './bin/xspecfem3D'                                          

    t1.cpu_reqs = {                                                             
        'cpu_processes': mpi_per_task,                                          
        'cpu_process_type': 'MPI',                                              
        'cpu_threads': 4,                                                       
        'cpu_thread_type': 'OpenMP'}                                            

    t1.gpu_reqs = {                                                             
        'gpu_processes': 1,                                                     
        'gpu_process_type': None,                                               
        'gpu_threads': 1,                                                       
        'gpu_thread_type': 'CUDA'}                                              

    return t1

Anything wrong with the task config?

andre-merzky commented 2 years ago

Hi @wjlei1990 - we got informed by another user that the ERF format is broken on summit at the moment. Tickets have been opened toward IBM, but so far we have no ETA for a fix I am afraid.

wjlei1990 commented 2 years ago

@andre-merzky thanks for the update!

wjlei1990 commented 2 years ago

My entk script is located here:

/gpfs/alpine/world-shared/geo111/lei/entk.small/entk.hrlee.py
lee212 commented 2 years ago

@wjlei1990, please try to use/replicate this script, you would be able to avoid ERF error,

/gpfs/alpine/world-shared/geo111/hrlee/entk.hrlee.py

It uses jsrun arguments to specify resource sets as a temporal fix. We will eventually try pmix/prrte later though.

wjlei1990 commented 2 years ago

Thanks~

Issue resolved! Ticket can be closed in today's meeting.