radical-collaboration / hpc-workflows

NSF16514 EarthCube Project - Award Number:1639694
5 stars 0 forks source link

ENTK hangs on Summit #109

Closed wjlei1990 closed 4 years ago

wjlei1990 commented 4 years ago

Hi My current entk python script hangs when running on summit.

I copied my files to world shared directory here so you may replicate the my tests.

/gpfs/alpine/geo111/world-shared/lei/entk

I also prepared a bash script for you to launch job directly.

/gpfs/alpine/geo111/world-shared/lei/entk/specfem/job_solver.bash

The system modules I used:

module load gcc/4.8.5
module load spectrum-mpi
module load hdf5/1.8.18
module load cuda

module load zlib
module load sz
module load zfp
module load c-blosc
lee212 commented 4 years ago

At the line number 46 of run_job.py, schema would be local instead of jsrun.

andre-merzky commented 4 years ago

As per slack exchange: the pilot sees a SIGNAL 2 while running - no other ERROR logs, no suspicious *.err/*.out files

wjlei1990 commented 4 years ago

@lee212 I actually used 'local' on summit...sorry, jsrun is just one try... I will edit it back to remove confusion.

lee212 commented 4 years ago

@wjlei1990 , thanks for the confirmation. No worries, I used your script on my account and have a similar issue.

wjlei1990 commented 4 years ago

Thanks for the help :)

lee212 commented 4 years ago

queue is missing e.g. "queue":"batch" in the res_dict?

wjlei1990 commented 4 years ago

OK Let me try now...

wjlei1990 commented 4 years ago

I tried adding "queue":"batch" and the job still hangs...

lsawade commented 4 years ago

Talked to @wjlei1990, issue I’m facing in Tiger #95 seem to be the same.

andre-merzky commented 4 years ago

@lee212 : you said you were able to reproduce this, right? Do you already have any idea whats up?

lee212 commented 4 years ago

This seems related to resource over-allocation, and the correct description would be:

    res_dict = {
        'resource': 'ornl.summit',
        'project': 'GEOxxx',
        'schema': 'local',
        'walltime': 10,
        'cpus': 168,
        'gpus': 6,
        'queue': 'batch'
    }

and the task resource would be:

    t1.cpu_reqs = {
        'processes': 6,
        'process_type': 'MPI',
        'threads_per_process': 4,
        'thread_type': 'OpenMP'}

    t1.gpu_reqs = {
            'processes': 1,
            'process_type': None,
            'threads_per_process': 1,
            'thread_type': 'CUDA'}

this will result in:

rank: 0: { host: 1; cpu: {0,1,2,3}; gpu: {0}}
rank: 1: { host: 1; cpu: {4,5,6,7}; gpu: {1}}
rank: 2: { host: 1; cpu: {8,9,10,11}; gpu: {2}}
rank: 3: { host: 1; cpu: {12,13,14,15}; gpu: {3}}
rank: 4: { host: 1; cpu: {16,17,18,19}; gpu: {4}}
rank: 5: { host: 1; cpu: {20,21,22,23}; gpu: {5}}
wjlei1990 commented 4 years ago

Hi could you explain to me why in res_dict, the value of cpus is 168.


Upates on the current test status:

  1. the single job of CPU and GPU are working, and the running time is expected.

  2. I am testing the multiple tasks running at the same time, and entk seems not happy with it. I got job hangs still.

lee212 commented 4 years ago

Summit compute nodes have (2) 22-core Power9 CPUs where each core supports 4 hardware threads, resulting in 168 = 2 * (22 - 1) * 4. 1 core on each socket has been set aside (-1) for overhead and is not available for allocation through jsrun.

I will look into the hangs on multiple tasks.

mturilli commented 4 years ago

This is critical for Summit allocation renewal. Data need to be ready that show we are running in production on Summit with EnTK.

mturilli commented 4 years ago

@wjlei1990 to address this we will need to reproduce your issue. Unfortunately, this means we will need some information from you:

We will run your workflow first with a single task, comparing it to our baseline and confirming the behavior your reported. We will then run the same workflow with two concurrent tasks to confirm the issue you report.

@lee212 do you see anything else you will need to debug this?

wjlei1990 commented 4 years ago

Summit compute nodes have (2) 22-core Power9 CPUs where each core supports 4 hardware threads, resulting in 168 = 2 * (22 - 1) * 4. 1 core on each socket has been set aside (-1) for overhead and is not available for allocation through jsrun.

I will look into the hangs on multiple tasks.

I think this may need some corrections.

I just tried one simulation which used 384 GPUS and 384 CPU cores. On summit the job should use 384/6=64 nodes. However, if I asked for 64 * 2 * (22-1) * 4 = 10752 CPUs and 384 GPUs, the slurm showed the job asked for 128 nodes, which is not correct.

My resource alllocation:

    res_dict = {
        'resource': 'ornl.summit',
        'project': 'GEO111',
        'schema': 'local',
        'walltime': 30,
        'gpus': 384,
        'cpus': 10752,
        'queue': 'batch'
    } 

    t1.cpu_reqs = {
        'processes': 384,
        'process_type': 'MPI',
        'threads_per_process': 4,
        'thread_type': 'OpenMP'}

    t1.gpu_reqs = {
        'processes': 1,
        'process_type': None,
        'threads_per_process': 1,                                               
        'thread_type': 'CUDA'}

Could you pls double check it?

wjlei1990 commented 4 years ago

@wjlei1990 to address this we will need to reproduce your issue. Unfortunately, this means we will need some information from you:

  • batch script (without EnTK) that correctly executes your executable. This will be our baseline.
  • The workflow you are trying to use with EnTK with instructions on how to run it.

We will run your workflow first with a single task, comparing it to our baseline and confirming the behavior your reported. We will then run the same workflow with two concurrent tasks to confirm the issue you report.

@lee212 do you see anything else you will need to debug this?

andre-merzky commented 4 years ago

@wjlei1990 : thanks for the batch script! What is the runtime we are expected to see? I don't mind running this test, but getting jobs of that size will always take a bit and will burn some allocation. If you happen to have a smaller test case available, let us know please :-)

wjlei1990 commented 4 years ago

@wjlei1990 : thanks for the batch script! What is the runtime we are expected to see? I don't mind running this test, but getting jobs of that size will always take a bit and will burn some allocation. If you happen to have a smaller test case available, let us know please :-)

Hi Andre, the running time shoule be around 1 min 20 sec, if submitted using lsf batch script.

andre-merzky commented 4 years ago

When running the batch script, I see:

solver starts at: Mon Mar 30 18:12:47 EDT 2020
jsrun -n 384 -a 1 -c 1 -g 1 ./bin/xspecfem3D
Mon Mar 30 18:12:47 EDT 2020

 **************
 **************
 ADIOS significantly slows down small or medium-size runs, which is the case here, please consider turning it off
 **************
 **************

User defined signal 2
ERROR:  One or more process (first noticed rank 259) terminated with signal 12

The runtime was about 4 seconds. Do you have any suggestion? I did a recursive copy of your specfem3d_globe_990cd4 directory and run from there (needed write permissions to OUTPUT_FILES/). Also, I had to enable the module load commands in the batch script to avoid unresolved library links.

wjlei1990 commented 4 years ago

Could you share the location of your running directory? May I take a look?

andre-merzky commented 4 years ago

Sure! It lives here:

/gpfs/alpine/med110/scratch/merzky1/covid/radical.pilot/specfem3d_globe_990cd4

But you will need to be in the med110 group :-( If you are not (which I guess) I can move to a world readable dir - but that will have to wait 'til tomorrow...

wjlei1990 commented 4 years ago

Sure! It lives here:

/gpfs/alpine/med110/scratch/merzky1/covid/radical.pilot/specfem3d_globe_990cd4

But you will need to be in the med110 group :-( If you are not (which I guess) I can move to a world readable dir - but that will have to wait 'til tomorrow...

I don't have access.

From the error message itself, I can't tell what is going wrong. Just to do a quick check, could you submit the job again?

I am also trying to give a more clean and lean built SPECFEM. I will do some tests and update to you later.

wjlei1990 commented 4 years ago

Sure! It lives here:

/gpfs/alpine/med110/scratch/merzky1/covid/radical.pilot/specfem3d_globe_990cd4

But you will need to be in the med110 group :-( If you are not (which I guess) I can move to a world readable dir - but that will have to wait 'til tomorrow...

Hi Andre, I rebuilt the SPECFEM and could you copy it again to test. The newly built one now only used system modules and libraries. The previous one has dependency on one library that sits on my own home directory.

The SPECFEM3d sits in the same directory: $WORLDWORK/geo111/lei/entk/specfem3d_globe_990cd4

lee212 commented 4 years ago

@wjlei1990 , I tried with different node counts i.e. 1/2/4/8/16/32/64 which is equivalent up to 384 gpus. It seemed I was not able to replicate the hanging issue but my test runs showed that, for example, 384 gpus for single task has the resource file like:

cpu_index_using: physical
rank: 0: { host: 1; cpu: {0,1,2,3}; gpu: {0}}
rank: 1: { host: 1; cpu: {4,5,6,7}; gpu: {1}}
rank: 2: { host: 1; cpu: {8,9,10,11}; gpu: {2}}
rank: 3: { host: 1; cpu: {12,13,14,15}; gpu: {3}}
rank: 4: { host: 1; cpu: {16,17,18,19}; gpu: {4}}
rank: 5: { host: 1; cpu: {20,21,22,23}; gpu: {5}}
rank: 6: { host: 2; cpu: {0,1,2,3}; gpu: {0}}
rank: 7: { host: 2; cpu: {4,5,6,7}; gpu: {1}}
rank: 8: { host: 2; cpu: {8,9,10,11}; gpu: {2}}
rank: 9: { host: 2; cpu: {12,13,14,15}; gpu: {3}}
rank: 10: { host: 2; cpu: {16,17,18,19}; gpu: {4}}
...

rank: 372: { host: 63; cpu: {0,1,2,3}; gpu: {0}}
rank: 373: { host: 63; cpu: {4,5,6,7}; gpu: {1}}
rank: 374: { host: 63; cpu: {8,9,10,11}; gpu: {2}}
rank: 375: { host: 63; cpu: {12,13,14,15}; gpu: {3}}
rank: 376: { host: 63; cpu: {16,17,18,19}; gpu: {4}}
rank: 377: { host: 63; cpu: {20,21,22,23}; gpu: {5}}
rank: 378: { host: 64; cpu: {0,1,2,3}; gpu: {0}}
rank: 379: { host: 64; cpu: {4,5,6,7}; gpu: {1}}
rank: 380: { host: 64; cpu: {8,9,10,11}; gpu: {2}}
rank: 381: { host: 64; cpu: {12,13,14,15}; gpu: {3}}
rank: 382: { host: 64; cpu: {16,17,18,19}; gpu: {4}}
rank: 383: { host: 64; cpu: {20,21,22,23}; gpu: {5}}

Similar placements I observed for the other runs.

My script is almost identical to yours and it is located at $WORLDWORK/csc393/hrlee/hpc-workflow/run_entk.py. The changes I made are:

$ diff /gpfs/alpine/world-shared/geo111/lei/entk/run_entk.py $WORLDWORK/csc393/hrlee/hpc-workflow/run_entk.py
55c55
<         'process_type': 'MPI',
---
>         'process_type': None,
57c57
<         'thread_type': 'OpenMP'}
---
>         'thread_type': 'CUDA'}
105c105
<     ncpus = int(nnodes * (22 - 1) * 4)
---
>     ncpus = int(nnodes * 2 * (22 - 1) * 4)
111c111
<         'project': 'GEO111',
---
>         'project': 'CSC393',
116c116
<         'walltime': 10,
---
>         'walltime': 5,

One comment I have is that your calculation for the ncpus is missing 2 * so the numbers will be half than you expected, I doubt if this is the main cause though.

Can you try with smaller node counts and see if it works?

BTW, I can't tell about the new executable, specfem3d_globe_990cd4/bin/xspecfem3D whether it runs as expected. I just ran a sanity check to evaluate if it completes scheduling with the requested resources.

andre-merzky commented 4 years ago

Thanks @lee212 : that task layout looks correct to me. I had the impression though that MPI would be needed? That should result in the same layout though.

wjlei1990 commented 4 years ago

@wjlei1990 , I tried with different node counts i.e. 1/2/4/8/16/32/64 which is equivalent up to 384 gpus. It seemed I was not able to replicate the hanging issue but my test runs showed that, for example, 384 gpus for single task has the resource file like:

cpu_index_using: physical
rank: 0: { host: 1; cpu: {0,1,2,3}; gpu: {0}}
rank: 1: { host: 1; cpu: {4,5,6,7}; gpu: {1}}
rank: 2: { host: 1; cpu: {8,9,10,11}; gpu: {2}}
rank: 3: { host: 1; cpu: {12,13,14,15}; gpu: {3}}
rank: 4: { host: 1; cpu: {16,17,18,19}; gpu: {4}}
rank: 5: { host: 1; cpu: {20,21,22,23}; gpu: {5}}
rank: 6: { host: 2; cpu: {0,1,2,3}; gpu: {0}}
rank: 7: { host: 2; cpu: {4,5,6,7}; gpu: {1}}
rank: 8: { host: 2; cpu: {8,9,10,11}; gpu: {2}}
rank: 9: { host: 2; cpu: {12,13,14,15}; gpu: {3}}
rank: 10: { host: 2; cpu: {16,17,18,19}; gpu: {4}}
...

rank: 372: { host: 63; cpu: {0,1,2,3}; gpu: {0}}
rank: 373: { host: 63; cpu: {4,5,6,7}; gpu: {1}}
rank: 374: { host: 63; cpu: {8,9,10,11}; gpu: {2}}
rank: 375: { host: 63; cpu: {12,13,14,15}; gpu: {3}}
rank: 376: { host: 63; cpu: {16,17,18,19}; gpu: {4}}
rank: 377: { host: 63; cpu: {20,21,22,23}; gpu: {5}}
rank: 378: { host: 64; cpu: {0,1,2,3}; gpu: {0}}
rank: 379: { host: 64; cpu: {4,5,6,7}; gpu: {1}}
rank: 380: { host: 64; cpu: {8,9,10,11}; gpu: {2}}
rank: 381: { host: 64; cpu: {12,13,14,15}; gpu: {3}}
rank: 382: { host: 64; cpu: {16,17,18,19}; gpu: {4}}
rank: 383: { host: 64; cpu: {20,21,22,23}; gpu: {5}}

Similar placements I observed for the other runs.

My script is almost identical to yours and it is located at $WORLDWORK/csc393/hrlee/hpc-workflow/run_entk.py. The changes I made are:

$ diff /gpfs/alpine/world-shared/geo111/lei/entk/run_entk.py $WORLDWORK/csc393/hrlee/hpc-workflow/run_entk.py
55c55
<         'process_type': 'MPI',
---
>         'process_type': None,
57c57
<         'thread_type': 'OpenMP'}
---
>         'thread_type': 'CUDA'}
105c105
<     ncpus = int(nnodes * (22 - 1) * 4)
---
>     ncpus = int(nnodes * 2 * (22 - 1) * 4)
111c111
<         'project': 'GEO111',
---
>         'project': 'CSC393',
116c116
<         'walltime': 10,
---
>         'walltime': 5,

One comment I have is that your calculation for the ncpus is missing 2 * so the numbers will be half than you expected, I doubt if this is the main cause though.

Can you try with smaller node counts and see if it works?

BTW, I can't tell about the new executable, specfem3d_globe_990cd4/bin/xspecfem3D whether it runs as expected. I just ran a sanity check to evaluate if it completes scheduling with the requested resources.

Hi @lee212 , I copied your script to my directory:

lei@login5 /gpfs/alpine/world-shared/geo111/lei/entk $ 
diff /gpfs/alpine/world-shared/geo111/lei/entk/run_entk.hrlee.py $WORLDWORK/csc393/hrlee/hpc-workflow/run_entk.py
111c111
<         'project': 'GEO111',
---
>         'project': 'CSC393',

However, the job is not successful. I doubt if your job is also succesful, since when I checked your job output directory, there is no output files generated:

ls $WORLDWORK/csc393/hrlee/hpc-workflow/run_0000/OUTPUT_FILES | wc -l
2

In a successful run, the output should be like this directory:

ls /gpfs/alpine/world-shared/geo111/lei/entk/specfem3d_globe_990cd4/OUTPUT_FILES/ | wc -l
739

Did you remove the output files of your job?

One more intesting behaviour I found in entk is, the python script seems to finish howevery the job is still running on the job queue. From my impression, the job should end first and the entk python script will finish and exit.

lee212 commented 4 years ago

Okay, I re-ran with the executable, and saw:

/bin/xspecfem3D: error while loading shared libraries: libblosc.so.1: cannot open shared object file: No such file or directory
wjlei1990 commented 4 years ago

Okay, I re-ran with the executable, and saw:

/bin/xspecfem3D: error while loading shared libraries: libblosc.so.1: cannot open shared object file: No such file or directory

I think you may missed some modules:

module load gcc/4.8.5
module load spectrum-mpi
module load hdf5/1.8.18
module load cuda

module load zlib
module load sz
module load zfp
module load c-blosc

Maybe I should put them into my scripts.

lee212 commented 4 years ago

I added these modules and ran a test only with 6gpus, result output is here: /gpfs/alpine/world-shared/csc393/hrlee/hpc-workflow /run_with_module_load_6gpus I see some warning/error messages but are these okay to ignore, can you confirm?

Does it have to run with 384 gpus? I submitted new job anyway which will be likely starting on Monday 2pm.

lee212 commented 4 years ago

Okay, the job with 384 gpus is also complete, it seems failed with some errors although these modules were added to pre_exec. The output is here /gpfs/alpine/world-shared/csc393/hrlee/hpc-workflow/run_with_module_load_384gpus

lee212 commented 4 years ago

Just in case, my stacks are:

                                                   │····································································
  radical.entk         : 1.0.2                     │····································································
  radical.pilot        : 1.2.1                     │····································································
  radical.saga         : 1.2.0                     │····································································
  radical.utils        : 1.2.2                     
wjlei1990 commented 4 years ago

Just in case, my stacks are:

                                                   │····································································
  radical.entk         : 1.0.2                     │····································································
  radical.pilot        : 1.2.1                     │····································································
  radical.saga         : 1.2.0                     │····································································
  radical.utils        : 1.2.2                     

Hi @lee212 I think I should have resolved most of the issue. Most of them are just some issues in my own script. Now I can successfully launch a few tasks in using ENTK.

There is one remaining question. I found when the ENTK exit, the job however still stays on the job queue and keep buring hours, even though I think all the tasks should have finished.

Here is what ENTK print to the terminal.

...
submit: ########################################################################
Update: pipeline.0000.stage.0000.task.0000 state: EXECUTED
Update: pipeline.0000.stage.0000.task.0000 state: DONE
Update: pipeline.0000.stage.0000.task.0001 state: EXECUTED
Update: pipeline.0000.stage.0000.task.0001 state: DONE
Update: pipeline.0000.stage.0000.task.0002 state: EXECUTED
Update: pipeline.0000.stage.0000.task.0002 state: DONE
Update: pipeline.0000.stage.0000.task.0003 state: EXECUTED
Update: pipeline.0000.stage.0000.task.0003 state: DONE
Update: pipeline.0000.stage.0000 state: DONE
Update: pipeline.0000 state: DONE
close unit manager                                                            ok
wait for 1 pilot(s)
              0                                                               ok
closing session re.session.login4.lei.018359.0009                              \
close pilot manager                                                            \
wait for 1 pilot(s)
              0                                                               ok
                                                                              ok
+ re.session.login4.lei.018359.0009 (json)
+ pilot.0000 (profiles)
+ pilot.0000 (logfiles)
session lifetime: 457.8s                                                      ok
All components terminated

I suppose the job(on the lsf job queue) should finish with the All components terminated message. Am I correct?

andre-merzky commented 4 years ago

Yes, the job should finish. @wjlei1990 , can you please provide (or point me to) a client side sandbox to check what is happening? That is an RP level error most likely.

wjlei1990 commented 4 years ago

a client side sandbox

Hi Andre, I put one example here:

/gpfs/alpine/world-shared/geo111/lei/entk.small/sandbox/re.session.login4.lei.018359.0011/
wjlei1990 commented 4 years ago

Hi I would also like to know how to do performance benchmark? Things that are interesting for me including how to measure the overhead, and also like the time spent for each task and stage.

According to @lee212 suggestion, I put following flags on my .bashrc file.

  export RADICAL_PROFILE="TRUE"
  export RADICAL_ENTK_PROFILE="TRUE"
  export RADICAL_PILOT_PROFILE="TRUE"

I think ENTK will generate some profiling files. Could you provide me some instructions on how to use it?

lee212 commented 4 years ago

radical.analytics might provide you some numbers/plots, but be aware that it is heavily under development and you may see errors/issues often. Please use it with caution. I think I have full instructions somewhere but a quick guide to try it out is here:

git clone https://github.com/radical-cybertools/radical.analytics.git
cd radical.analytics
pip install .
export RADICAL_PILOT_DBURL=mongodb://rct:rct_test@two.radical-project.org/rct_test

Once this is complete, you run analytics for a particular session, in practice I do like, for example,:

ln -s /gpfs/alpine/world-shared/geo111/lei/entk.small/re.session.login4.lei.018359.0011/
.
bin/radical-analytics-inspect re.session.login4.lei.018359.0011

If this completes successfully, you may find files generated like:

re.session.login4.lei.018359.0011.stats
re.session.login4.lei.018359.0011_conc.png
re.session.login4.lei.018359.0011_dur.png
re.session.login4.lei.018359.0011_rate.png
re.session.login4.lei.018359.0011_util.png

*.stats file provides timing value in plain text and others are plots with different filters to show.

Matteo and Andre can provide better explanation and usage in depth, and correct me if something is missing.

lee212 commented 4 years ago

@wjlei1990 , I know this is about testing 384 gpus for one task, for now, but do you or did you run a test with multiple tasks as well? I am just curious if the current version works seamlessly when we increase the number of concurrent tasks.

wjlei1990 commented 4 years ago

@wjlei1990 , I know this is about testing 384 gpus for one task, for now, but do you or did you run a test with multiple tasks as well? I am just curious if the current version works seamlessly when we increase the number of concurrent tasks.

I did test with 5 concurrent tasks and each task with 384 nodes. The job run successfully and output files looks good to me. I haven't done any performance check yet.

wjlei1990 commented 4 years ago

Here is one example output from radical-analytics-inspect.

Maybe you can teach me how to explain it on this week's meeting.


1. Small Scale Test

This one is from a small scale test.

re.session.login4.lei.018359.0011 [4]
    Agent Nodes         :          0.000     0.000%   !  ['agent']
    Pilot Startup       :      10608.329     8.095%      ['boot', 'setup_1']
    Warmup              :       2509.556     1.915%      ['warm']
    Prepare Execution   :          2.492     0.002%      ['exec_queue', 'exec_prep']
    Pilot Termination   :     120440.127    91.906%      ['term']
    Execution RP        :         13.004     0.010%      ['exec_rp', 'exec_sh', 'term_sh', 'term_rp']
    Execution Cmd       :      15596.728    11.902%      ['exec_cmd']
    Unschedule          :          8.740     0.007%      ['unschedule']
    Draining            :       2084.176     1.590%      ['drain']
    Idle                :      97030.667    74.043%      ['idle']
    total               :     131046.778   100.000%      

    total               :     131046.778   100.000%
    over                :     232697.090   177.568%
    work                :      15596.728    11.902%
    miss                :    -117247.040   -89.470%

re session login4 lei 018359 0011 state re session login4 lei 018359 0011_conc re session login4 lei 018359 0011_dur re session login4 lei 018359 0011_rate re session login4 lei 018359 0011_util


2. Full Scale Test

This one is full scale test, with 5 concurrent task and each one with 384 nodes. There are total of 50 tasks in the stage.

re.session.login5.lei.018358.0000 [1]
    Agent Nodes         :          0.000     0.000%   !  ['agent']
    Pilot Startup       :     352848.755    18.936%      ['boot', 'setup_1']
    Warmup              :      78204.028     4.197%      ['warm']
    Prepare Execution   :         52.962     0.003%      ['exec_queue', 'exec_prep']
    Pilot Termination   :    1511431.231    81.113%      ['term']
    Execution RP        :        216.342     0.012%      ['exec_rp', 'exec_sh', 'term_sh', 'term_rp']
    Execution Cmd       :     132558.541     7.114%      ['exec_cmd']
    Unschedule          :         72.372     0.004%      ['unschedule']
    Draining            :      32980.902     1.770%      ['drain']
    Idle                :    1171608.708    62.876%      ['idle']
    total               :    1863362.381   100.000%      

    total               :    1863362.381   100.000%
    over                :    3147415.300   168.911%
    work                :     132558.541     7.114%
    miss                :   -1416611.460   -76.024%

Below are figures. re session login5 lei 018358 0000 state re session login5 lei 018358 0000_conc re session login5 lei 018358 0000_dur re session login5 lei 018358 0000_rate re session login5 lei 018358 0000_util

andre-merzky commented 4 years ago

Hi @wjlei1990 :

a client side sandbox Hi Andre, I put one example here: /gpfs/alpine/world-shared/geo111/lei/entk.small/sandbox/re.session.login4.lei.018359.0011/

Thanks - but that is the pilot sandbox. I meant the session directory on the client side, i.e., which is created in the location where you run the EnTK script. Thanks!

As for the analysis: there is something off with the utilization obviously, I'll look into it. But in general, the utilization won't look great since you are not using the CPU cores, and we count those as idle resources then.

wjlei1990 commented 4 years ago

Hi @wjlei1990 :

a client side sandbox Hi Andre, I put one example here: /gpfs/alpine/world-shared/geo111/lei/entk.small/sandbox/re.session.login4.lei.018359.0011/

Thanks - but that is the pilot sandbox. I meant the session directory on the client side, i.e., which is created in the location where you run the EnTK script. Thanks!

As for the analysis: there is something off with the utilization obviously, I'll look into it. But in general, the utilization won't look great since you are not using the CPU cores, and we count those as idle resources then.

Hi Andre, the directory is here:

/gpfs/alpine/world-shared/geo111/lei/entk.small/re.session.login4.lei.018359.0011

This one is a small-scale job.

If you are looking for a full-scale job:

/gpfs/alpine/world-shared/geo111/lei/entk/re.session.login5.lei.018358.0000
andre-merzky commented 4 years ago

Thanks. From the logs, it looks like the pilot job gets canceled all right:

radical.log:1586271908.106 : pmgr_launching.0000  : 52789 : 140735340868016 : DEBUG    : update cancel req: pilot.0000 1586271908.1062713
radical.log:1586271908.107 : pmgr_launching.0000  : 52789 : 140735340868016 : DEBUG    : killing pilots: last cancel: 1586271908.1062713
radical.log:1586271916.831 : pmgr_launching.0000  : 52789 : 140735776813488 : DEBUG    : bulk states: ['Running']
radical.log:1586271926.845 : pmgr_launching.0000  : 52789 : 140735776813488 : DEBUG    : bulk states: ['Running']
radical.log:1586271936.858 : pmgr_launching.0000  : 52789 : 140735776813488 : DEBUG    : bulk states: ['Running']
radical.log:1586271946.871 : pmgr_launching.0000  : 52789 : 140735776813488 : DEBUG    : bulk states: ['Running']
radical.log:1586271956.884 : pmgr_launching.0000  : 52789 : 140735776813488 : DEBUG    : bulk states: ['Done']

Cancellation takes a while, but that's LSF taking its time. Do you see the job alive for longer than a couple of minutes?

wjlei1990 commented 4 years ago

Thanks. From the logs, it looks like the pilot job gets canceled all right:

radical.log:1586271908.106 : pmgr_launching.0000  : 52789 : 140735340868016 : DEBUG    : update cancel req: pilot.0000 1586271908.1062713
radical.log:1586271908.107 : pmgr_launching.0000  : 52789 : 140735340868016 : DEBUG    : killing pilots: last cancel: 1586271908.1062713
radical.log:1586271916.831 : pmgr_launching.0000  : 52789 : 140735776813488 : DEBUG    : bulk states: ['Running']
radical.log:1586271926.845 : pmgr_launching.0000  : 52789 : 140735776813488 : DEBUG    : bulk states: ['Running']
radical.log:1586271936.858 : pmgr_launching.0000  : 52789 : 140735776813488 : DEBUG    : bulk states: ['Running']
radical.log:1586271946.871 : pmgr_launching.0000  : 52789 : 140735776813488 : DEBUG    : bulk states: ['Running']
radical.log:1586271956.884 : pmgr_launching.0000  : 52789 : 140735776813488 : DEBUG    : bulk states: ['Done']

Cancellation takes a while, but that's LSF taking its time. Do you see the job alive for longer than a couple of minutes?

Got you. So as long as the job finished with a few minutes within entk script exiting, it should be fine. I think I observed a few minutes lag for my small scale job.

I haven't monitored the large-scale job yet since it is a bit difficult(to predict when the job will be running). But I will keep an eye on it.