radical-collaboration / hpc-workflows

NSF16514 EarthCube Project - Award Number:1639694
5 stars 0 forks source link

Performance metric for radical.analytics #117

Closed wjlei1990 closed 4 years ago

wjlei1990 commented 4 years ago

Hi, I am recently doing some performance analysis for the ENTK tools on summit. The radical.analytics works great for our purpose but we find it might be usefull if it can generate more measurements metrics.

Currently, the radical.analytics will genetate one plain text file that summarize the usage of resource and 4 figures. It would be great if it can output more details in text mode(plain text, json or so), we as users could read and plot what we are interested in our way.

Metric I am curious about:

  1. more details about each task and stage, such as starttime, endtime, number of cpus and gpus, and total cpu resource and gpu resource consumed.

  2. metrics taht help user to better understand the overhead, such as the resource consumed by ENTK, but not the user's tasks.

One thing Andre mentioned is that the tools is now only analyzing cpu resource but not gpu resource, if I remember correctly.

I think the tool will be very usefull since it can help us to understand the resource usage each time we launch large-scaled jobs, and help us to monitor potentiol issues and failures.

andre-merzky commented 4 years ago

One thing Andre mentioned is that the tools is now only analyzing cpu resource but not gpu resource, if I remember correctly.

That's an misunderstanding: the tools do analyze GPU utilization, too. My point was that the utilization numbers will take both, GPU and CPU, into account - and since CPU cores are idle, the utilization numbers will look rather bad for GPU use cases.

Thanks for the list of metrics, we'll provide a script to dig them out.

wjlei1990 commented 4 years ago

One thing Andre mentioned is that the tools is now only analyzing cpu resource but not gpu resource, if I remember correctly.

That's an misunderstanding: the tools do analyze GPU utilization, too. My point was that the utilization numbers will take both, GPU and CPU, into account - and since CPU cores are idle, the utilization numbers will look rather bad for GPU use cases.

Thanks for the list of metrics, we'll provide a script to dig them out.

Cool. So maybe it should provide analytics data in various granularity.

mturilli commented 4 years ago

@wjlei1990 we discussed your metrics and we think we are ready to give it a go. Would you have a meaningful session to share with the profiles of both EnTK and RADICAL-Pilot? We would then work on that session to produce some early plots for some of the metrics you have listed. We are planning to do that on a notebook that then we would share and you would be able to iterate/expand upon as needed.

wjlei1990 commented 4 years ago

Hi Matteo, can you access those files on summit?

sandbox here /gpfs/alpine/world-shared/geo111/lei/entk/sandbox/re.session.login5.lei.018358.0000 and here /gpfs/alpine/world-shared/geo111/lei/entk/re.session.login5.lei.018358.0000

wjlei1990 commented 4 years ago

A few more metric that came to my mind:

  1. the number of nodes ENTK asked for to run the job script
  2. the total running time of lsf job script
  3. the lsf job id

I am not sure if it is easy for you to extract those values.

mturilli commented 4 years ago

From slack:

sandbox here `/gpfs/alpine/world-shared/geo111/lei/entk/sandbox/re.session.login5.lei.018358.0000`
and here `/gpfs/alpine/world-shared/geo111/lei/entk/re.session.login5.lei.018358.0000`
mturilli commented 4 years ago

@wjlei1990 , see https://github.com/radical-experiments/hpcw-princeton/blob/master/analysis/summit_pr_wf.ipynb for an initial analysis with some of the metrics you requested and some others that might be useful

wjlei1990 commented 4 years ago

Hi Matteo,

I prepared two more running examples. so you may use that as baseline model for ENTK performance evaluation.

  1. normal scale: each task uses 384 GPUs and 64 nodes.

    • client: /gpfs/alpine/world-shared/geo111/lei/entk/re.session.login4.lei.018394.0006
    • sandbox: /gpfs/alpine/world-shared/geo111/lei/entk/sandbox/re.session.login4.lei.018394.0006.tar
  2. small scale: each task uses 6 GPUs and 1 node.

    • client: /gpfs/alpine/world-shared/geo111/lei/entk.source_inversion/re.session.login4.lei.018394.0005
    • sandbox: /gpfs/alpine/world-shared/geo111/lei/entk.source_inversion/sandbox/re.session.login4.lei.018394.0005
mturilli commented 4 years ago

Hi @wjlei1990 , @lsawade please see an initial characterization of RTC performance for the two runs you shared at https://github.com/radical-experiments/hpcw-princeton/blob/master/analysis/incite2020.ipynb

Good news, our resource utilization efficiency is ~97% for the source inversion and ~94% for the structural inversion. Note that these figures do not account for idling CPUs. This is because the CPUs would be available if your workload would need to use them.

Do you want/need to go bigger with the structural inversion? I saw you are still using JSRUN. We can probably go (much) bigger by using PRRTE if you need that kind of scale.

wjlei1990 commented 4 years ago

Great! The result looks very promising. I will dig more into it tomrrow.

We defintely would like to go bigger if ENTK and the machine is stable at even larger scale. I can do some large-scale tests in the near future. What is the scale (in terms of number of nodes on Summit) are you confident to push to?

mturilli commented 4 years ago

I think we can double down. With JSRUN, we can run a maximum of 900 tasks. You run 100 so far, we could try 200? One thing I noticed: You run the full workflow in ~21 minutes and you asked for 60 minutes. You might be able to run 200 tasks on the same 640 nodes in around 45 minutes.

Ultimately, I think it depends on how big you would like to run your production runs and on what scale you say you will target in the proposal.

wjlei1990 commented 4 years ago

OK. Here is the running cases for strong scaling. I am keeping the total number of tasks to 200 for all the cases.

Case I: simul_run_task = 5, total_task = 200

I asked for 64 * 5 = 320 nodes in this case.

Case II: simul_run_task = 10, total_task = 200

I asked for 64 * 10 = 640 nodes in this case.

Case III: simul_run_task = 20, total_task = 200

I asked for 64 * 20 = 1280 nodes in this case. I checked from the user portal saying this job takes 0.32 hours, so about 18 minutes. It is a bit longer than I expect it to be...Would be interesting to see your performance analysis. radical.analytics can also pull out the total running time of the job right?

mturilli commented 4 years ago

Thank you @wjlei1990. I added the analysis of the new sessions to https://github.com/radical-experiments/hpcw-princeton/blob/master/analysis/incite2020.ipynb . Note that something went wrong with the re.session.login5.lei.018411.0003 session. The units seems not having run anything. This does not create any event related to the unit and therefore we are not able to quantify how much time it took to run all the units. In turn, this makes impossible to compare runtime and overhead with the other sessions (re.session.login4.lei.018394.0006 and re.session.login5.lei.018411.0002). Should we try to run at 1280 nodes with the same workflow we used for 640 and 320 nodes?