radical-collaboration / hpc-workflows

NSF16514 EarthCube Project - Award Number:1639694
5 stars 0 forks source link

job failed on Summit using ENTK #125

Closed wjlei1990 closed 3 years ago

wjlei1990 commented 4 years ago

I had two jobs submitted on summit and both of them failed... I checked the sandbox, there is no unit.0000* generated. The job seems just got killed somehow, without running anything.

In the terminal I got this output. Could you help me figure out why the job failed?

EnTK session: re.session.login3.lei.018420.0000
Creating AppManagerSetting up RabbitMQ system                                 ok
                                                                              ok
Validating and assigning resource manager                                     ok
Setting up RabbitMQ system                                                   n/a
new session: [re.session.login3.lei.018420.0000]                               \
database   : [mongodb://hpcw-pr:RMVjg2eQd2RW4nfv@129.114.17.185:27017/hpcw-pr]ok
create pilot manager                                                          ok
submit 1 pilot(s)
        [ornl.summit:107520]
                                                                              ok
sclosing session re.session.login3.lei.018420.0000                              \
close pilot manager                                                            \
wait for 1 pilot(s)
              0                                                               ok
                                                                              ok
session lifetime: 21069.3s                                                    ok
wait for 1 pilot(s)
              0                                                          timeout
Execution failed, error: 'NoneType' object has no attribute '_uid'
Traceback (most recent call last):
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/execman/rp/resource_manager.py", line 179, in _submit_resource_request
    self._pilot.wait([rp.PMGR_ACTIVE, rp.DONE, rp.FAILED, rp.CANCELED])
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/pilot/compute_pilot.py", line 535, in wait
    time.sleep(0.1)
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/appman/appmanager.py", line 414, in run
    self._rmgr._submit_resource_request()
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/execman/rp/resource_manager.py", line 192, in _submit_resource_request
    raise KeyboardInterrupt
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "jrun_entk.hrlee.py", line 144, in main
    appman.run()
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/appman/appmanager.py", line 438, in run
    self.terminate()
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/appman/appmanager.py", line 483, in terminate
    write_session_description(self)
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/utils/prof_utils.py", line 143, in write_session_description
    tree[amgr._uid]['children'].append(wfp._uid)
AttributeError: 'NoneType' object has no attribute '_uid'
andre-merzky commented 4 years ago

@wjlei1990 : would you mind attaching the pilot sandbox? Thanks!

wjlei1990 commented 4 years ago

@andre-merzky Here is one sandbox for failed job:

/gpfs/alpine/world-shared/geo111/lei/entk/sandbox/failed/re.session.login3.lei.018421.0000
wjlei1990 commented 4 years ago

Still have some issues after update of radical.stack.

Number of nodes: 640
Number of cpus and gpus: 107520, 3840
Execution failed, error: 'NoneType' object has no attribute '_uid'
Traceback (most recent call last):
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/execman/rp/resource_manager.py", line 179, in _submit_resource_request
    self._pilot.wait([rp.PMGR_ACTIVE, rp.DONE, rp.FAILED, rp.CANCELED])
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/pilot/compute_pilot.py", line 535, in wait
    time.sleep(0.1)
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/appman/appmanager.py", line 413, in run
    self._rmgr._submit_resource_request()
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/execman/rp/resource_manager.py", line 192, in _submit_resource_request
    raise KeyboardInterrupt
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "jrun_entk.hrlee.py", line 144, in main
    appman.run()
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/appman/appmanager.py", line 437, in run
    self.terminate()
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/appman/appmanager.py", line 482, in terminate
    write_session_description(self)
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/utils/prof_utils.py", line 145, in write_session_description
    tree[amgr._uid]['children'].append(wfp._uid)
AttributeError: 'NoneType' object has no attribute '_uid'

Here is the radical.stack:

radical-stack          

  python               : 3.7.6
  pythonpath           : /sw/summit/xalt/1.2.0/site:/sw/summit/xalt/1.2.0/libexec
  virtualenv           : summit-entk

  radical.analytics    : 0.90.7-v0.72.0-40-gf8034b0@devel
  radical.entk         : 1.4.0
  radical.pilot        : 1.1.1-devel_no_bulk_cb@hotfix-entk_hangup
  radical.saga         : 1.1.2
  radical.utils        : 1.1.1

The sandbox is here:

/gpfs/alpine/world-shared/geo111/lei/entk/sandbox/re.session.login2.lei.018423.0001

I will launch another job today to see if I get lucky...

mturilli commented 4 years ago

Hi @wjlei1990 , let's debug this further before attempting another submission. See my comments on slack about updating the whole stack so that both of us have the same environment.

wjlei1990 commented 4 years ago

Hi All,

I think I should have updated all the radical package:

radical-stack 

  python               : 3.7.6
  pythonpath           : /sw/summit/xalt/1.2.0/site:/sw/summit/xalt/1.2.0/libexec
  virtualenv           : summit-entk

  radical.analytics    : 0.90.7-v0.72.0-40-gf8034b0@devel
  radical.entk         : 1.4.0
  radical.pilot        : 1.4.1
  radical.saga         : 1.4.0
  radical.utils        : 1.4.0

But I still have issues running ENTK. The most recent job I submitted has the error in the unit of sandbox:


Due to MODULEPATH changes, the following have been reloaded:
  1) c-blosc/1.12.1               4) python/3.7.0     7) zfp/0.5.2
  2) hdf5/1.8.18                  5) sz/2.0.2.0       8) zlib/1.2.11
  3) py-setuptools/40.4.3-py3     6) zeromq/4.2.5

The following have been reloaded with a version change:
  1) gcc/8.1.1 => gcc/4.8.5

Activating Modules:
  1) adios/1.13.1-py2

Due to MODULEPATH changes, the following have been reloaded:
  1) darshan-runtime/3.1.7     2) hdf5/1.8.18

Currently Loaded Modules:
  1) hsi/5.0.2.p5              12) zeromq/4.2.5
  2) lsf-tools/2.0             13) spectrum-mpi/10.3.1.2-20200121
  3) DefApps                   14) darshan-runtime/3.1.7
  4) vim/8.1.0338              15) hdf5/1.8.18
  5) tmux/2.2                  16) cuda/10.1.243
  6) py-pip/10.0.1-py3         17) zlib/1.2.11
  7) py-virtualenv/16.0.0      18) sz/2.0.2.0
  8) prrte/1.0.0_devtiming     19) zfp/0.5.2
  9) gcc/4.8.5                 20) c-blosc/1.12.1
 10) python/3.7.0              21) adios/1.13.1-py2
 11) py-setuptools/40.4.3-py3

prun: Error: unknown option "--hnp"

I copied the sandbox here:

/gpfs/alpine/world-shared/geo111/lei/entk.test/sandbox/re.session.login3.lei.018432.0001

I prepare a small scale job and you may copy it and test using it:

/gpfs/alpine/world-shared/geo111/lei/entk.test
wjlei1990 commented 3 years ago

Had a debug session with Matteo. Jobs up running the the run time is longer than expected. We are seeing 10m running time for each task (should be around 2 min). Will try a few things to see if we can fix the issue.

andre-merzky commented 3 years ago

@mtitov : Mikhail, can you advise on the current scheme to set the SMT level for summit? Thanks!

mturilli commented 3 years ago

This is is task description:

    t1.cpu_reqs = {
        'processes': mpi_per_task,
        'process_type': 'MPI',
        'threads_per_process': 4,
        'thread_type': 'OpenMP'}
    t1.gpu_reqs = {
        'processes': 1,
        'process_type': None,
        'threads_per_process': 1,
        'thread_type': 'CUDA'}

in the .sh script of the units I have

export "CUDA_VISIBLE_DEVICES=0"

and the .rs lists the GPU correctly:

cpu_index_using: physical
rank: 0: { host: 1; cpu: {0,1,2,3}; gpu: {0}}
rank: 1: { host: 1; cpu: {4,5,6,7}; gpu: {1}}
rank: 2: { host: 1; cpu: {8,9,10,11}; gpu: {2}}
rank: 3: { host: 1; cpu: {12,13,14,15}; gpu: {3}}
rank: 4: { host: 1; cpu: {16,17,18,19}; gpu: {4}}
rank: 5: { host: 1; cpu: {20,21,22,23}; gpu: {5}}
mtitov commented 3 years ago

Information about SMT is defined in resource config as following:

        "system_architecture"         : {"smt": 4,
                                         "options": ["gpumps"]}

currently it yet available in devel-branchs for both Pilot and SAGA

also SMT could be regulated with env variable: export RADICAL_SAGA_SMT=2 (valid values are 1,2,4)

wjlei1990 commented 3 years ago

Hi, thanks or the info. Does that mean I need to install the devel branches for both Pilot and SAGA?

mtitov commented 3 years ago

@wjlei1990 yes, since that new attribute system_architecture was just lately introduced

andre-merzky commented 3 years ago

@wjlei1990 : the RCT releases will go out today, if you want to delay testing for a day or so you can use the released version.

andre-merzky commented 3 years ago

The new releases have been pushed to pypi, and the system_architecture problem should be resolved after an update of the stack.

wjlei1990 commented 3 years ago

Hi, I updated the radical-stack and the issue seems to be still there. The task is still running very slow (9 min insteand of 2 min but it runs and finishes succesfully). Here is my stack:

radical-stack

  python               : 3.7.6
  pythonpath           : /sw/summit/xalt/1.2.0/site:/sw/summit/xalt/1.2.0/libexec
  virtualenv           : summit-entk

  radical.analytics    : 1.5.0
  radical.entk         : 1.5.1
  radical.gtod         : 1.5.0
  radical.pilot        : 1.5.4
  radical.saga         : 1.5.4
  radical.utils        : 1.5.4

Here is my task description:

   t1 = Task()                                                                 
    t1.pre_exec = [                                                             
        'cd {}'.format(task_dir),                                               
        'module load gcc/4.8.5',                                                
        'module load spectrum-mpi',                                             
        'module load hdf5/1.8.18',                                              
        'module load cuda',                                                     
        'module load zlib',                                                     
        'module load sz',                                                       
        'module load zfp',                                                      
        'module load c-blosc',                                                  
        'export CUDA_VISIBLE_DEVICES=0',                                        
        "export OMP_NUM_THREADS=1"                                              
    ]                                                                           
    t1.executable = ['./bin/xspecfem3D']                                        

    t1.cpu_reqs = {                                                             
        'processes': 6,                                              
        'process_type': 'MPI',                                                  
        'threads_per_process': 4,                                               
        'thread_type': 'OpenMP'}                                                

    t1.gpu_reqs = {                                                             
        'processes': 1,                                                         
        'process_type': None,                                                   
        'threads_per_process': 1,                                               
        'thread_type': 'CUDA'}          

Here is resource description:

    res_dict = {                                                                
        'resource': 'ornl.summit',                                              
        'project': 'GEO111',                                                    
        'schema': 'local',                                                      
        'job_name': 'test-w',                                                                                                                                               
        'walltime': 60,                                                         
        'gpus': 12,                                                          
        'cpus': 336,                                                          
        'queue': 'batch'                                                        
    }                               
andre-merzky commented 3 years ago

@mtitov : Mikhail, can you please reference or provide a piece of documentation for @wjlei1990 on how to configure SMT level for summit? Thanks!

mtitov commented 3 years ago

@andre-merzky: I'm checking logs (had contacted Wenjie to go through latest run), and I see that CUDA_VISIBLE_DEVICES is not set correctly. I've found that for each task 6 GPUs are assigned, so RCT skips it (in scheduler). Can you, please, confirm that if we set number of cpu processes then for each process requested number of gpu processes is assigned (as in example task t1 would have 6 cpu processes with 1 gpu per each, thus task would have 6 gpus)?

example of unit.000000.sl:

{'cores_per_node': 167,
 'gpus_per_node': 6,
 'lfs_per_node': {'path': None, 'size': 0},
 'lm_info': {'cvd_id_mode': 'logical'},
 'mem_per_node': 0,
 'nodes': [{'core_map': [[0, 1, 2, 3]],
            'gpu_map': [[0]],
            'lfs': {'path': None, 'size': 0},
            'mem': 0,
            'name': 'e27n08',
            'uid': '1'},
           {'core_map': [[4, 5, 6, 7]],
            'gpu_map': [[1]],
            'lfs': {'path': None, 'size': 0},
            'mem': 0,
            'name': 'e27n08',
            'uid': '1'},
           {'core_map': [[8, 9, 10, 11]],
            'gpu_map': [[2]],
            'lfs': {'path': None, 'size': 0},
            'mem': 0,
            'name': 'e27n08',
            'uid': '1'},
           {'core_map': [[12, 13, 14, 15]],
            'gpu_map': [[3]],
            'lfs': {'path': None, 'size': 0},
            'mem': 0,
            'name': 'e27n08',
            'uid': '1'},
           {'core_map': [[16, 17, 18, 19]],
            'gpu_map': [[4]],
            'lfs': {'path': None, 'size': 0},
            'mem': 0,
            'name': 'e27n08',
            'uid': '1'},
           {'core_map': [[20, 21, 22, 23]],
            'gpu_map': [[5]],
            'lfs': {'path': None, 'size': 0},
            'mem': 0,
            'name': 'e27n08',
            'uid': '1'}]}

Thus need to split task t1 from the example above into 6 separate tasks with 1 CPU and GPU processes, right?

p.s. SMT is set correctly there

andre-merzky commented 3 years ago

Ouch, this we need to fix. What you stumble over is this. The underlying problem is that _handle_cuda attempts to add CUDA_VISIBLE_DEVICES to the task's environment - but if you look at the task layout above, you'll notice that we would need to set different env variables per rank. I don't think we can do that easily in the current code. It points again to the deeper underlying issue that we do not support heterogeneous ranks - even if the heterogeneity is, as in this case, just a single env setting.

There are two ways (AFAICS) to solve this, none trivial:

Now, what to do in your case as that needs resolving quickly... What does the application actually expect: different env setting per rank, or CVD=0,1,2,3,4,5 for all ranks? Or something different altogether?

andre-merzky commented 3 years ago

But also, I should have read your comment to the end :-) Yes, splitting into 1cpu/1gpu task would resolve this. If that is feasible depends on the application though, as the tasks would not share an MPI communicator anymore...

mtitov commented 3 years ago

depends on the application though, as the tasks would not share an MPI communicator anymore...

I guess this actually brings us to your previous comment. Yeah, I was thinking on how to resolve task definition in terms of RCT the first, and missed this point.

wjlei1990 commented 3 years ago

Hi all, thanks for your help. So I guess I will need to wait for some updates of ENTK to make the task running correctly. Am I right?

wjlei1990 commented 3 years ago

Hi all, as mentioned by @mtitov, I just updated the radical-stack.

radical-stack

  python               : 3.7.6
  pythonpath           : /sw/summit/xalt/1.2.0/site:/sw/summit/xalt/1.2.0/libexec
  virtualenv           : summit-entk

  radical.analytics    : 1.5.0
  radical.entk         : 1.5.1
  radical.gtod         : 1.5.0
  radical.pilot        : 1.5.5
  radical.saga         : 1.5.6
  radical.utils        : 1.5.4

After relaunching the job, it still has the same issue. My job script is located in /gpfs/alpine/world-shared/geo111/lei/entk.small/run_entk.hrlee.py on Summit.

Some desciption about the task. For each task, it will use 6 CPU and 6 GPU. There will 1 MPI allocated on 1 CPU and 1 GPU. @mtitov has mentioned to change the task definition or executable script (add some wrapper?) may resolve the issue. Could you provide me some instructions how to do it?

andre-merzky commented 3 years ago

@mtitov : can you have a look at the -B option as described here? It might be the right option to use to convince jsrun to set the correct CUDA_VISIBLE_DEVICES value...

andre-merzky commented 3 years ago

@mtitov : ping

mtitov commented 3 years ago

Since RP always sets env variable CUDA_VISIBLE_DEVICES ether to a specific value or keeps it empty, then the extra operation from a user in task pre_exec solved the issue unset CUDA_VISIBLE_DEVICES (was confirmed by @wjlei1990)