radical-collaboration / hpc-workflows

NSF16514 EarthCube Project - Award Number:1639694
5 stars 0 forks source link

Tracking ENTK Performance Test on Summit #142

Closed wjlei1990 closed 2 years ago

wjlei1990 commented 3 years ago

I have conducted a few test cases on Summit using radical.entk.

Each task is using 384 GPU (64 nodes), which is the same size as our current production case.

The current tests contains slurm jobs that can run simulataneous tasks 1, 5, 10, and 20 at the same time. Those slurm jobs used 64, 320, 640 and 1280 nodes on Summit.

We will further analyze the results using radical.analytics.

If I want to use radical.analytics to get some performance metrics, is radical.analytics online doc a good starting point?

wjlei1990 commented 3 years ago

I noticed on summit, if entk failed to launch the job, then it pops the exception:

EnTK session: re.session.login1.lei.018775.0006
Creating AppManagerSetting up RabbitMQ system                                 ok
                                                                              ok
Validating and assigning resource manager                                     ok
Setting up RabbitMQ system                                                   n/a
new session: [re.session.login1.lei.018775.0006]                               \
database   : [mongodb://hpcw-pr:****@129.114.17.185:27017/hpcw-pr]            ok
create pilot manager                                                          ok
submit 1 pilot(s)
        pilot.0000   ornl.summit           53760 cores    1920 gpus           ok
closing session re.session.login1.lei.018775.0006                              \
close pilot manager                                                            \
wait for 1 pilot(s)
              0                                                               ok
                                                                              ok
session lifetime: 38.5s                                                       ok
wait for 1 pilot(s)
              0                                                          timeout
Execution failed, error: 'NoneType' object has no attribute '_uid'
Traceback (most recent call last):
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/appman/appmanager.py", line 433, in run
    self._rmgr.submit_resource_request()
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/execman/rp/resource_manager.py", line 199, in submit_resource_request
    self._pilot.wait([rp.PMGR_ACTIVE, rp.DONE, rp.FAILED, rp.CANCELED])
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/pilot/pilot.py", line 558, in wait
    time.sleep(0.1)
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "run_entk.hrlee.py", line 184, in main
    appman.run()
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/appman/appmanager.py", line 457, in run
    self.terminate()
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/appman/appmanager.py", line 502, in terminate
    write_session_description(self)
  File "/ccs/home/lei/.conda/envs/summit-entk/lib/python3.7/site-packages/radical/entk/utils/prof_utils.py", line 148, in write_session_description
    tree[amgr._uid]['children'].append(wfp._uid)
AttributeError: 'NoneType' object has no attribute '_uid'

It is not a big deal. I think it may be just a small bug (that raised an unhandled exception) in the entk software.

wjlei1990 commented 3 years ago

Hi @mturilli, I am wondering if there is any instructions for the performance analysis?

mturilli commented 3 years ago

Hi @wjlei1990, happy to try to run the same analysis we used for the previous proposal on your new session. Let me know where I can find it and I will see what we can get out of it.

wjlei1990 commented 3 years ago

Hi @mturilli, thanks for the help. The client log is located here:

/gpfs/alpine/world-shared/geo111/lei/entk

Then sandbox log is located here:

/gpfs/alpine/world-shared/geo111/lei/entk/sandbox
mturilli commented 3 years ago

@wjlei1990 , please find the update notebook for INCITE 2021 at: https://github.com/radical-experiments/hpcw-princeton/blob/master/analysis/incite2021.ipynb

I used the latest session at the specified path as it seemed the only one that had successfully completed (re.session.login1.lei.018775.0008)

wjlei1990 commented 3 years ago

@mturilli it is weird...because what i saw on the terminal is that all the jobs are done. I checked the ouptut files they are good...

What do you suggest me to do? Maybe I can submit all the jobs one more time.

mturilli commented 3 years ago

Thanks @wjlei1990. I double checked and found all the profiles I needed. I pushed an iteration of the notebook with the plots of all the sessions.

wjlei1990 commented 3 years ago

@mturilli Cool! Thanks very much. I think we can discusss it a bit more during our next meeting, and what is the next thing to be done for our paper!

wjlei1990 commented 3 years ago

Hi @mturilli, I did another performance measurement on Summit using real simulation cases.

The thrid case is not so perfect. I think there are some slow nodes involved in the simulation.

So I am going to relaunch this one and update the figure.

incite_2021_ovh_ttx incite_2021_ru