Performance Benchmark on Summit (slow Summit computing node encounted or not?)

wjlei1990 commented 2 years ago

Hi Radical team, I recently have performed some benchmark tests to measure the ENTK on Summit.

Update 1: sandbox files have been copied to: `/gpfs/alpine/world-shared/geo111/lei/sandbox_log/summit.sandbox_log` Let me know if you have any permission issues or not.

The first test is kind of weak scaling test. The number of tasks increase linearly with number of nodes: Screenshot from 2022-03-04 11-28-31

Here is the figure of task timeline: incite_2022_weak_ru

Here is another figure generated using the python notebook incite_2022_weak_ovh_ttx

I noticed that if the node number is over 256, then the exec_time of some tasks will be much longer than expected. At the first glance, I thought it would be some issues with Summit.

I have attached the client-side log and job script in this thread. Please help us to check. log.weak_scaling.zip

Here is the python notebook plotting script: plotting_script.zip

Here is the entk python script used to submit jobs on Summit: entk.hrlee.py.txt

wjlei1990 commented 2 years ago

Here is another performance test I made on Summit. Here is the kind of 'strong scaling' test. I keep the total number of tasks the same but vary the number of computing nodes. Screenshot from 2022-03-04 12-14-03

incite_2022_strong_ru

Here is another figure: incite_2022_strong_ovh_ttx

Here are the client-side logs: log.strong_scaling.zip

andre-merzky commented 2 years ago

Hi @wjlei1990 - the plots below look a bit closer into the contributions of the individual task runtimes. X-axis is the task ID, y-axis is the duration of the contributions, the differently colored plots show what time the task spend in what activity. Scheduling is not shown, but the scheduling data looked as expected and did not contribute to the pattern.

re session login5 lei 019044 0000_dur

re session login5 lei 019044 0000_dur log

Both plots show the same data, only the log(y) differs. The data are for re.session.login5.lei.019044.0000. The dominating contribution, and also the contribution which seems to be responsible for the observed pattern, is the duration between the task_exec_start and task_exec_stop events. The only thing that happens between these events is the jsrun command which executes the actual task. All other contributions are overheads introduced by RCT but are all in the expected range (note though that the pre-exec is fairly large with 10 seconds also, but is constant for all tasks and thus does not contribute to the pattern you see)

It is not possible to tell (from RCT perspective) if the delays are caused by the application which simply runs longer, by the node which is slower for some reason, by the filesystem which stalls, or by any combination of those. If that behavior remains problematic we should open a support ticket with OLCF.

wjlei1990 commented 2 years ago

Hi @andre-merzky, thanks for the investigation and detailed explanation. The slowdown here is sort of what I would expect. The executable runtime should be almost the same between different tasks. So this shouldn't be due to different runtime of tasks. This behavior could be caused by Summit slow running computing node or could be stalled filesystem.

In terms of reporting to Summit, what information should I provide to them? Is it possible to extract the LSF job id from the ENTK log, so I can ask them to take a close look at the system log?

andre-merzky commented 2 years ago

Hi @wjlei1990 ,

Is it possible to extract the LSF job id from the ENTK log, so I can ask them to take a close look at the system log?

Yes. It's a bit hacky, but here you go:

$ ls -l
total 24
drwxrwxr-x 3 merzky merzky 4096 Mar  2 04:38 re.session.login1.lei.019045.0001
drwxrwxr-x 3 merzky merzky 4096 Mar  2 04:38 re.session.login4.lei.019039.0000
drwxrwxr-x 3 merzky merzky 4096 Mar  2 04:38 re.session.login4.lei.019040.0000
drwxrwxr-x 3 merzky merzky 4096 Mar  2 04:38 re.session.login4.lei.019040.0001

$ grep -r LSB_DJOB_HOSTFILE * | tr '/' '.' | cut -f 1-6,28 -d . 
re.session.login1.lei.019045.0001.1865442
re.session.login4.lei.019039.0000.1851409
re.session.login4.lei.019040.0000.1853133
re.session.login4.lei.019040.0001.1854493

The output is now <session_id>.<lsf_jobid>.

What information should I provide to them?

The job ID is indeed the important one. I am not sure if they have detailed logs - if not they likely will ask you to run again and to let them know the job number to turn on monitoring.

Other important info would be (IMHO): size of the job (they get that from the jobid though), the number and kind of tasks you are running, the expected runtime and the observation. I think it's ok if you point them to this issue as it has all relevant information really, and I'm happy to chime in if they want to comment / ask more details here.

andre-merzky commented 2 years ago

Hey @wjlei1990 - did you hear back from Summit support on this topic, by any chance?

radical-collaboration / hpc-workflows