radical-collaboration / hpc-workflows

NSF16514 EarthCube Project - Award Number:1639694
5 stars 0 forks source link

some entk jobs errored out (task not running) #152

Open wjlei1990 opened 2 years ago

wjlei1990 commented 2 years ago

Hi Entk team,

Recently (last two weeks), I encounted entk error out for multiple times:

EnTK session: re.session.login1.lei.019081.0000                                                                             
Creating AppManagerSetting up RabbitMQ system                                 ok                                            
                                                                              ok                                            
Validating and assigning resource manager                                     ok                                            
Setting up RabbitMQ system                                                   n/a                                            
new session: [re.session.login1.lei.019081.0000]                               \                                            
database   : [mongodb://rct:****@apps.marble.ccs.ornl.gov:32020/rct_test]     ok                                            
create pilot manager                                                          ok                                            
submit 1 pilot(s)                                                                                                                                                                                                                                      
        pilot.0000   ornl.summit           86016 cores    3072 gpus           ok                                            
closing session re.session.login1.lei.019081.0000                              \                                                                                                                                                                       
close pilot manager                                                            \                                                                                                                                                                       
wait for 1 pilot(s)                                           
              0                                                               ok                                                                                                                                                                       
                                                                              ok                                            
session lifetime: 61431.8s                                                    ok                                                                                                                                                                       
wait for 1 pilot(s)                                                                                                                                                                                                                                    
              0                                                          timeout                                            
Execution failed, error: 'NoneType' object has no attribute '_uid'                                                                                                                                                                                     
Traceback (most recent call last):                                                                                                                                                                                                                     
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/appman/appmanager.py", line 443, in run
    self._rmgr.submit_resource_request()                     
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/execman/rp/resource_manager.py", line 199, in submit_resource_request                                                                                           
    self._pilot.wait([rp.PMGR_ACTIVE, rp.DONE, rp.FAILED, rp.CANCELED])                                                     
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/pilot/pilot.py", line 558, in wait                                                                                                                                   
    time.sleep(0.1)                                          
KeyboardInterrupt                                            

During handling of the above exception, another exception occurred:                                                        

Traceback (most recent call last):                           
  File "entk.hrlee.py", line 190, in main                    
    appman.run()                                             
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/appman/appmanager.py", line 467, in run
    self.terminate()                                         
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/appman/appmanager.py", line 512, in terminate
    write_session_description(self)                          
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/utils/prof_utils.py", line 148, in write_session_description
    tree[amgr._uid]['children'].append(wfp._uid)             
AttributeError: 'NoneType' object has no attribute '_uid'                                                                  

Do you have any idea what caused this issue?

andre-merzky commented 2 years ago

Hi @wjlei1990 - Alas I don't have a quick answer based on that message as that can have multiple causes. Could you please attach the pilot sandbox to this ticket? On what machine is that run executed? Thank you!

wjlei1990 commented 2 years ago

entk_log.zip

Hi @andre-merzky - The job is running on Summit. I attached both the client and sandbox log in the zip file. Let's talk about it in today's meeting.

wjlei1990 commented 2 years ago

Here is the radical-stack:

radical-stack 
WARNING: You are using radical.entk version 1.8.0, however version 1.9.0 is available.

  python               : /ccs/home/lei/.conda/envs/summit-entk/bin/python3
  pythonpath        : /sw/summit/xalt/1.2.1/site:/sw/summit/xalt/1.2.1/libexec
  version              : 3.7.6
  virtualenv          : summit-entk

  radical.analytics  : 1.6.7
  radical.entk         : 1.8.0
  radical.gtod         : 1.6.7
  radical.pilot         : 1.8.0
  radical.saga        : 1.8.0
  radical.utils         : 1.8.2

The version used to work on Summit like a month ago. That's how I generated the previous benchmark test results.

wjlei1990 commented 2 years ago

@andre-merzky I removed the python related stuff in the sandbox. It seems the same issue occurred again:

EnTK session: re.session.login3.lei.019087.0000
Creating AppManagerSetting up RabbitMQ system                                 ok
                                                                              ok
Validating and assigning resource manager                                     ok
Setting up RabbitMQ system                                                   n/a
new session: [re.session.login3.lei.019087.0000]                               \
database   : [mongodb://rct:****@apps.marble.ccs.ornl.gov:32020/rct_test]     ok
create pilot manager                                                          ok
submit 1 pilot(s)
        pilot.0000   ornl.summit           10752 cores     384 gpus           ok

wait for 1 pilot(s)
              0                                                               ok
closing session re.session.login3.lei.019087.0000                              \
close pilot manager                                                            \
wait for 1 pilot(s)
              0                                                               ok
                                                                              ok
+ re.session.login3.lei.019087.0000 (json)
+ pilot.0000 (profiles)
+ pilot.0000 (logfiles)
session lifetime: 919.2s                                                      ok
Execution failed, error: 'NoneType' object has no attribute '_uid'
Traceback (most recent call last):
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/appman/appmanager.py", line 448, in run
    'ended in state %s' % res_alloc_state)
radical.entk.exceptions.EnTKError: Cannot proceed. Resource ended in state DONE

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "entk.hrlee.py", line 190, in main
    appman.run()
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/appman/appmanager.py", line 473, in run
    self.terminate()
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/appman/appmanager.py", line 512, in terminate
    write_session_description(self)
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/utils/prof_utils.py", line 148, in write_session_description
    tree[amgr._uid]['children'].append(wfp._uid)
AttributeError: 'NoneType' object has no attribute '_uid'

Here is the sandbox output. re.session.login3.lei.019087.0000.zip

andre-merzky commented 2 years ago

Well, it actually kind of worked. Alas the version you are using pulls pymongo as default. After releasing that stack (1.8), pymongo saw an API update which broke backward compatibility. We now specify pymongo<4 as dependency. Please do remove the virtualenv once more, update your local stack (all versions should be 1.13 or better from pypi), and try once more.

Thanks, Andre.

wjlei1990 commented 2 years ago

Super! Thanks @andre-merzky. Let me try and I will update to you shortly!

Here is the updated stack on Summit:

radical-stack 
1649179225.722 : radical.analytics    : 3798474 : 140736356307968 : INFO     : radical.analytics    version: 1.13.0

  python               : /ccs/home/lei/.conda/envs/summit-entk/bin/python3
  pythonpath           : /sw/summit/xalt/1.2.1/site:/sw/summit/xalt/1.2.1/libexec
  version              : 3.7.6
  virtualenv           : summit-entk

  radical.analytics    : 1.13.0
  radical.entk         : 1.13.0
  radical.gtod         : 1.13.0
  radical.pilot        : 1.13.0
  radical.saga         : 1.13.0
  radical.utils        : 1.13.0
wjlei1990 commented 2 years ago

@andre-merzky Seems to work to a certain point (not yet fully functional yet). Tasks got scheduled in ENTK but failed to execute.

EnTK session: re.session.login3.lei.019087.0001                                                                                                                                                                                                               
Creating AppManagerSetting up RabbitMQ system                                 ok                                                                                                                                                                              
                                                                              ok                                                                                                                                                                              
Validating and assigning resource manager                                     ok                                                                                                                                                                              
Setting up RabbitMQ system                                                   n/a                                                                                                                                                                              
new session: [re.session.login3.lei.019087.0001]                               \                                                                                                                                                                              
database   : [mongodb://rct:****@apps.marble.ccs.ornl.gov:32020/rct_test]     ok                                                                                                                                                                              
create pilot manager                                                          ok                                                                                                                                                                              
submit 1 pilot(s)                                                                                                                                                                                                                                             
        pilot.0000   ornl.summit           10752 cores     384 gpus           ok                                                                                                                                                                              
All components created                                                                                                                                                                                                                                        
create task managerUpdate: pipeline.0000 state: SCHEDULING                                                                                                                                                                                                    
Update: pipeline.0000.stage.0000 state: SCHEDULING                                                                                                                                                                                                            
Update: pipeline.0000.stage.0000.task.0000 state: SCHEDULING                                                                                                                                                                                                  
Update: pipeline.0000.stage.0000.task.0001 state: SCHEDULING                                                                                                                                                                                                  
Update: pipeline.0000.stage.0000.task.0000 state: SCHEDULED                                                                                                                                                                                                   
Update: pipeline.0000.stage.0000.task.0001 state: SCHEDULED                                                                                                                                                                                                   
Update: pipeline.0000.stage.0000 state: SCHEDULED                                                                                                                                                                                                             
MongoClient opened before fork. Create MongoClient only after forking. See PyMongo's documentation for details: http://api.mongodb.org/python/current/faq.html#is-pymongo-fork-safe                                                                           
                                                           ok                                                                                                                                                                                                 
submit: ########################################################################                                                                                                                                                                              
Update: pipeline.0000.stage.0000.task.0000 state: SUBMITTING                                                                                                                                                                                                  
Update: pipeline.0000.stage.0000.task.0001 state: SUBMITTING     
Update: pipeline.0000.stage.0000.task.0000 state: EXECUTED       
Update: pipeline.0000.stage.0000.task.0000 state: FAILED         
Update: pipeline.0000.stage.0000.task.0001 state: FAILED                                                                           
Update: pipeline.0000.stage.0000 state: DONE                     
Update: pipeline.0000 state: DONE                                                                                                  
close task manager                                                            ok                                                   
wait for 1 pilot(s)                                                                                                                
              0                                                               ok                                                   
closing session re.session.login3.lei.019087.0001                              \                                                   
close pilot manager                                                            \                                                   
wait for 1 pilot(s)                                                                                                                
              0                                                               ok                                                   
                                                                              ok
+ re.session.login3.lei.019087.0001 (json)                       
+ pilot.0000 (profiles)                                                                                                            
+ pilot.0000 (logfiles)                                                                                                            
session lifetime: 829.8s                                                      ok                                     
All components terminated                                                             

This is the error I got from task.0000.out:

cat task.0000.out
XALT Error: unable to find jsrun

sandbox log file: re.session.login3.lei.019087.0001.zip

In the create_task function, I am using the jsrun directly:

def create_task(task_dir):                                                      

    t1 = Task()                                                                 
    t1.pre_exec = [                                                             
        'cd {}'.format(task_dir),                                               
        'unset CUDA_VISIBLE_DEVICES',                                           
        #'export CUDA_VISIBLE_DEVICES=0',                                       
        "export OMP_NUM_THREADS=1",                                             
        'module load gcc/9.3.0',                                                
        'module load spectrum-mpi',                                             
        'module load hdf5/1.10.7',                                              
        'module load adios/1.13.1',                                             
        'module load sz/2.0.2.0',                                               
        'module load zlib',                                                     
        'module load zfp',                                                      
        'module load c-blosc',                                                  
    ]                                                                           
    t1.executable = './bin/xspecfem3D'                                          
    t1.executable = "cat /dev/null; jsrun --bind rs -n{} -p{} -r6 -g1 -c1 ".format(mpi_per_task, mpi_per_task) + t1.executable

    t1.cpu_reqs = {                                                             
        'cpu_processes': mpi_per_task,                                          
        'cpu_process_type': 'MPI',                                              
        'cpu_threads': 4,                                                       
        'cpu_thread_type': 'OpenMP'}                                            

    print("mpi_per_task: ", mpi_per_task)                                       

    t1.gpu_reqs = {                                                             
        'gpu_processes': 1,                                                     
        'gpu_process_type': None,                                               
        'gpu_threads': 1,                                                       
        'gpu_thread_type': 'CUDA'}                                              

    return t1                       
andre-merzky commented 2 years ago
t1.executable = "cat /dev/null; jsrun --bind rs -n{} -p{} -r6 -g1 -c1 ".format(mpi_per_task, mpi_per_task) + t1.executable

I am afraid that won't work - you should only need to specify xspecfm3D as executable (as in the commented line before) along with the core and gpu requirements, and RP will make sure that jsrun is called with the correct parameters. The way you use it now basically means we run

jsrun <our args> jsrun <your args> xspecfm3d <specfm args>

which is bound to fail (jsrun is not available on the compute nodes).

Also, btw, please use

t1.sandbox = task_dir

instead of the

cd <task_dir>

in your pre_exec.

wjlei1990 commented 2 years ago

@andre-merzky, the current script is modified based on this ticket: https://github.com/radical-collaboration/hpc-workflows/issues/150#issuecomment-943629896

Back in 2021 Oct, ENTK fails to launch specfem3d_globe on Summit using convention ways. So a temporary fix was brought up to make specfem can run again:

It uses jsrun arguments to specify resource sets as a temporal fix. We will eventually try pmix/prrte later though.

I tried to convert the script back by just setting:

t1.executable = './bin/xspecfem3D'

However, I still got errors when I try to run the script today. Here is the output from ENTK:

EnTK session: re.session.login3.lei.019088.0001
Creating AppManagerSetting up RabbitMQ system                                 ok
                                                                              ok
Validating and assigning resource manager                                     ok
Setting up RabbitMQ system                                                   n/a
new session: [re.session.login3.lei.019088.0001]                               \
database   : [mongodb://rct:****@apps.marble.ccs.ornl.gov:32020/rct_test]     ok
create pilot manager                                                          ok
submit 1 pilot(s)
        pilot.0000   ornl.summit           10752 cores     384 gpus           ok
All components created
create task managerUpdate: pipeline.0000 state: SCHEDULING
Update: pipeline.0000.stage.0000 state: SCHEDULING
Update: pipeline.0000.stage.0000.task.0001 state: SCHEDULING
Update: pipeline.0000.stage.0000.task.0000 state: SCHEDULING
Update: pipeline.0000.stage.0000.task.0001 state: SCHEDULED
Update: pipeline.0000.stage.0000.task.0000 state: SCHEDULED
Update: pipeline.0000.stage.0000 state: SCHEDULED
MongoClient opened before fork. Create MongoClient only after forking. See PyMongo's documentation for details: http://api.mongodb.org/python/current/faq.html#is-pymongo-fork-safe
                                                           ok
submit: ########################################################################
Update: pipeline.0000.stage.0000.task.0001 state: SUBMITTING
Update: pipeline.0000.stage.0000.task.0000 state: SUBMITTING
Update: pipeline.0000.stage.0000.task.0001 state: FAILED
Update: pipeline.0000.stage.0000.task.0000 state: FAILED
Update: pipeline.0000.stage.0000 state: DONE
Update: pipeline.0000 state: DONE
close task manager                                                            ok
wait for 1 pilot(s)
              0                                                               ok
closing session re.session.login3.lei.019088.0001                              \
close pilot manager                                                            \
wait for 1 pilot(s)
              0                                                               ok
                                                                              ok
+ re.session.login3.lei.019088.0001 (json)
+ pilot.0000 (profiles)
+ pilot.0000 (logfiles)
session lifetime: 2463.3s                                                     ok
All components terminated

sandbox: re.session.login3.lei.019088.0001.zip

The job scipt is here. entk.hrlee.py.txt

The job is located on Summit: /gpfs/alpine/world-shared/geo111/lei/entk

andre-merzky commented 2 years ago

Hi @wjlei1990 - could you please also attach the task sandbox /gpfs/alpine/geo111/world-shared/lei/entk/run_0000/task.0000.err? That is now outside of the pilot sandbox so I'm afraid I can't see what's happening to the task with above tarball... Thanks!

wjlei1990 commented 2 years ago

here you go! I think this is what you mean task sandbox or client sandbox, right? re.session.login3.lei.019088.0001.zip

andre-merzky commented 2 years ago

Alas no, not really. I think you set in your task description something like

task.sandbox = '/gpfs/alpine/geo111/world-shared/lei/entk/run_0001'

or similar. Those run directories now contain the information about the task execution. By default those run directories live inside the pilot sandbox - but since you set the task sandbox manually, they now do not and are thus not included in the tarball you attached...

wjlei1990 commented 2 years ago

Here you go. @andre-merzky run_0000.log.zip

Actually, I took a look into the task.0000.err, it is a empty file.

However, task.0000.out has some output information:

$ cat task.0000.out
XALT Error: unable to find jsrun
wjlei1990 commented 2 years ago

Hi @andre-merzky, my running directory is here: /gpfs/alpine/world-shared/geo111/lei/entk.small

What you need to copy in that folder is:

1) entk.hrlee.py
2) specfem3d_globe

Then you can try it and see if there is anything you need to fix or not.

mtitov commented 2 years ago

@wjlei1990 hi Wenjie, can you please update radical stack (stack that you've used also workable, but configs required some fixes, which is done in RE/RP since v.1.14).

p.s. btw we also can try PRRTE as a launch method, and its updates are about to be merged into devel-branch

wjlei1990 commented 2 years ago

@mtitov Once I updated the radical stack, do you want me to submit tasks again to test if it works?

mtitov commented 2 years ago

@wjlei1990 hi Wenjie, yes, can you please try with jsrun ('resource': 'ornl.summit') and if that will fail then with prrte(*) ('resource': 'ornl.summit_prte')

(*) yet devel-branch: pip install git+https://github.com/radical-cybertools/radical.pilot.git@devel

wjlei1990 commented 2 years ago

@mtitov ('resource': 'ornl.summit') still failed. I am going to try prte. I will update to you.

Here is my radical-stack output:

radical-stack
1652236225.369 : radical.analytics    : 443609 : 140736004314112 : INFO     : radical.analytics    version: 1.14.0

  python               : /ccs/home/lei/.conda/envs/summit-entk/bin/python3
  pythonpath           : /sw/summit/xalt/1.2.1/site:/sw/summit/xalt/1.2.1/libexec
  version              : 3.7.6
  virtualenv           : summit-entk

  radical.analytics    : 1.14.0
  radical.entk         : 1.14.0
  radical.gtod         : 1.13.0
  radical.pilot        : 1.14.0
  radical.saga         : 1.14.0
  radical.utils        : 1.14.0
mtitov commented 2 years ago

@wjlei1990 hi Wenjie, is there some updates on using PRRTE? (just in case of status updates, no rush)

p.s. meanwhile, can you share (somewhere on Summit within world-shared) pilot sandbox on that latest failed run?

wjlei1990 commented 2 years ago

@mtitov Yes.

Session is: re.session.login2.lei.019132.0000

The sandbox is here:

/gpfs/alpine/world-shared/geo111/lei/entk.small/sandbox

The client log is here:

/gpfs/alpine/world-shared/geo111/lei/entk.small
cat task.0001.err 
Failed to bind process to ERF smt array, err: Invalid argument
Failed to bind process to ERF smt array, err: Invalid argument
Failed to bind process to ERF smt array, err: Invalid argument
Failed to bind process to ERF smt array, err: Invalid argument
Failed to bind process to ERF smt array, err: Invalid argument
wjlei1990 commented 2 years ago

@mtitov I updated the radical.pilot to the most recent devel branch and used summit-prte. Jobs still failed.

This is my radical-stack:

radical-stack
1653321658.251 : radical.analytics    : 2345471 : 140735608607584 : INFO     : radical.analytics    version: 1.14.0

  python               : /ccs/home/lei/.conda/envs/summit-entk/bin/python3
  pythonpath           : /sw/summit/xalt/1.2.1/site:/sw/summit/xalt/1.2.1/libexec
  version              : 3.7.6
  virtualenv           : summit-entk

  radical.analytics    : 1.14.0
  radical.entk         : 1.14.0
  radical.gtod         : 1.13.0
  radical.pilot        : 1.14.0
  radical.saga         : 1.14.0
  radical.utils        : 1.14.0

The session id is re.session.login2.lei.019135.0000.

The python job script is located here: /gpfs/alpine/world-shared/geo111/lei/entk.small/entk.hrlee.py

I also copied the sandbox and client log into the directory mentioned above.

Please let me know if you have any comments, and how to fix this.

mtitov commented 2 years ago

@wjlei1990 I still work on jsrun case, but for prrte please make the following updates:

  1. use RP devel branch, latest updates were not yet released (pip install git+https://github.com/radical-cybertools/radical.pilot.git@devel)
  2. add the following line into tasks pre_exec attribute (since prrte isolates env defined for its launching from env used for running an executable, and module command happen to be unknown): . /sw/summit/lmod/lmod/init/profile
wjlei1990 commented 2 years ago

@mtitov Do you mean this? Adding . /sw/summit/lmod/lmod/init/profile to the very beginning of pre_exec?

    t1.pre_exec = [                                                             
        ". /sw/summit/lmod/lmod/init/profile",                                  
        'unset CUDA_VISIBLE_DEVICES',                                           
        #'export CUDA_VISIBLE_DEVICES=0',                                       
        "export OMP_NUM_THREADS=1",                                             
        'module load gcc/9.3.0',                                                
        'module load spectrum-mpi',                                             
        'module load hdf5/1.10.7',                                              
        'module load adios/1.13.1',                                             
        'module load sz/2.0.2.0',                                               
        'module load zlib',                                                     
        'module load zfp',                                                      
        'module load c-blosc',                                                  
    ]                   

I tried it and it failed. If I am not adding the command correctly, please correct me so I can submit it again