radical-collaboration / hpc-workflows

NSF16514 EarthCube Project - Award Number:1639694
5 stars 0 forks source link

ENTK Performance for Seismic inversion (Profiling) #56

Closed wjlei1990 closed 6 years ago

wjlei1990 commented 6 years ago

Is it possible to provide the users with tools that will visualize the profile files? such as blocks and time lines shows when did job start and end, how much resource did each job end? So it will be easier for users to check the status of jobs.

mturilli commented 6 years ago

Agree upon a dedicated meeting to introduce analytics and see whether we should define a dedicated workflow

vivek-bala commented 6 years ago

@wjlei1990 Incase you are interested: https://github.com/radical-cybertools/radical.analytics

wjlei1990 commented 6 years ago

I have run a few example scripts from radical.analytics. The output log is here.

wjlei1990 commented 6 years ago

According to our last discussion, I have run SPECFEM3D GLOBE forward simulation without ENTK to test the run time. The time is listed below:

Type Number of nodes Time
ENTK with 4 forward runs 4*384 754 - 768 sec
PBS job with 1 forward run 384 284 sec
PBS job with 1 forward run and soft link to mesh files 384 327 sec

The run time for ENTK is taken from here.

The Single Run with soft link to mesh files is to mimic the run procedures of ENTK, where I put the run directory in the sandbox and mesh files on stage out directory(which is another filesystem). This one takes slightly long time compared to the Single Run, in which case I put all the files in the same directory.

The raw profile files(41) is attached. [rp.session.titan-ext4.lei.017645.0001.41.zip](https://github.com/radical-collaboration/hpc-workflows/files/2095936/rp.session.titan-ext4.lei.017645.0001.4.1.zip)

mpbl commented 6 years ago

Wenjie to run the EntK with multiple pilot sizes / run numbers. The idea is to factor out what is due to startup overhead and what might be a scalability problem. number of forward runs: 1 and maybe 2, 4, 8 and other numbers

vivek-bala commented 6 years ago

Hey Wenjie, do share the raw profiles as well. Feel free to upload it as zip file to this ticket.

wjlei1990 commented 6 years ago

Another test run is finished. In this job, only 1 forward simuation runs at one time and there are 4 serial runs, so in total of 1 * 4 = 4 forward runs.

By runing 06_events.py, it is shown:

------------------------------------------------------------------------------
pure exec times (exec_start ... exec_stop)
------------------------------------------------------------------------------

unit.000003:  395.31
unit.000002:  366.87
unit.000001:  379.70
unit.000000:  500.91
unit.000004:  371.31
average    :  402.82

The first unit with rank 0 is the mesher.

The raw profile files(14) is attached. [rp.session.titan-ext1.lei.017694.0006.14.zip](https://github.com/radical-collaboration/hpc-workflows/files/2095935/rp.session.titan-ext1.lei.017694.0006.1.4.zip)

mpbl commented 6 years ago

@wjlei1990 Quick question: how did you time:

wjlei1990 commented 6 years ago

@mpbl

I am using command in the job submission script to measure the time:

echo *Start time: `date`*
aprun -n 384 ./bin/xspecfem3D
echo *End time: `date`*
wjlei1990 commented 6 years ago

Another test run is finished. 2 forward simulations are running at the same time and 2 serial runs. So in total 2 * 2 = 4 forward simulations.

By runing 06_events.py, it is shown:

------------------------------------------------------------------------------
pure exec times (exec_start ... exec_stop)
------------------------------------------------------------------------------

unit.000003:  380.92
unit.000002:  363.54
unit.000001:  366.96
unit.000000:  509.10
unit.000004:  356.34
average    :  395.37

Which is comparable to the 1*4 runs.

The raw profile files(22) is attached. [rp.session.titan-ext1.lei.017694.0008.22.zip](https://github.com/radical-collaboration/hpc-workflows/files/2099559/rp.session.titan-ext1.lei.017694.0008.2.2.zip)

wjlei1990 commented 6 years ago

I rerun the 4*1 test. The new test shows different result.

------------------------------------------------------------------------------
pure exec times (exec_start ... exec_stop)
------------------------------------------------------------------------------

unit.000003: 1068.38
unit.000002:  446.59
unit.000001:  488.00
unit.000000: 1191.80
unit.000004:  383.74
average    :  715.70

The raw profile files(41) are attached. [rp.session.titan-ext4.lei.017695.0001.41.zip](https://github.com/radical-collaboration/hpc-workflows/files/2099589/rp.session.titan-ext4.lei.017695.0001.4.1.zip)

Currently, I am running a 4*4 test on Titan.

wjlei1990 commented 6 years ago

The 4*4 test run just went through.

------------------------------------------------------------------------------
pure exec times (exec_start ... exec_stop)
------------------------------------------------------------------------------

unit.000009:  366.98
unit.000008:  462.65
unit.000003:  400.08
unit.000002:  388.48
unit.000001:  368.57
unit.000000:  502.99
unit.000014: 1332.23
unit.000015:  359.94
unit.000005:  366.50
unit.000006:  389.23
unit.000013:  377.17
unit.000010:  365.93
unit.000016:  403.79
unit.000011:  357.54
unit.000004:  373.83
unit.000007:  374.15
unit.000012:  404.69
average    :  446.75

Most of the unit seems normal except unit.000014.

The raw profile files(44) are attached. [rp.session.titan-ext4.lei.017695.0008.44.zip](https://github.com/radical-collaboration/hpc-workflows/files/2100646/rp.session.titan-ext4.lei.017695.0008.4.4.zip)


We just checked the unit.000014 and the reason it lasts so long is due to the GPU node failure.

mpbl commented 6 years ago

@wjlei1990 Would you mind checking the mean time per time step to see if this discrepancy for unit 14 is due to a bad node?

wjlei1990 commented 6 years ago

Another 4*4 run with resubmit=True

------------------------------------------------------------------------------
pure exec times (exec_start ... exec_stop)
------------------------------------------------------------------------------

unit.000009:  366.41
unit.000008:  361.24
unit.000003:  381.02
unit.000002:  359.46
unit.000001:  370.68
unit.000000:  501.79
unit.000014:  375.48
unit.000015:  369.53
unit.000005:  381.34
unit.000006:  360.39
unit.000013:  377.61
unit.000010:  358.57
unit.000016:  364.44
unit.000011:  362.21
unit.000004:  362.66
unit.000007:  355.17
unit.000012:  354.79
average    :  374.28

rp.session.titan-ext4.lei.017696.0001.4*4.zip

mpbl commented 6 years ago

We should add a test with a twice shorter physical duration to understand if the overhead is 10% of the run or a fixed 30s.

wjlei1990 commented 6 years ago

4*4 test with 60 min(50% time length of previous runs) seismograms:

------------------------------------------------------------------------------
pure exec times (exec_start ... exec_stop)
------------------------------------------------------------------------------

unit.000009:  237.33
unit.000008:  235.46
unit.000003:  250.06
unit.000002:  241.07
unit.000001:  254.68
unit.000000:  533.17
unit.000014:  267.83
unit.000015:  236.63
unit.000005:  237.25
unit.000006:  234.15
unit.000013:  255.10
unit.000010:  240.27
unit.000016:  237.81
unit.000011:  238.46
unit.000004:  256.94
unit.000007:  235.97
unit.000012:  236.39
average    :  260.50

rp.session.titan-ext4.lei.017714.0002.4*4.60min.zip

Type Number of nodes Time
ENTK with 4 forward runs 4*384 234 - 267 sec
PBS job with 1 forward run 384 178 sec
PBS job with 1 forward run and soft link to mesh files 384 165 - 205 sec

Data Point for 3rd case: 197sec, 165sec, 195sec, 205sec

mpbl commented 6 years ago

Overhead seems constant, as expected by the Radical team. We should launch a full-sized production job as a final validation. (17s shortest period, 384 GPUs).

wjlei1990 commented 6 years ago

Full production run test results:

17s resolution using 384 GPU nodes, 120min simulation, SAVE_FORWARD on

Type Number of nodes Time
ENTK with 4*4 forward runs 4*384 463 - 512 sec
PBS job with 1 forward run 384 436 sec
PBS job with 1 forward run and soft link to mesh files 384 441 - 510 sec

Detailed log for each unit in ENTK:

------------------------------------------------------------------------------
pure exec times (exec_start ... exec_stop)
------------------------------------------------------------------------------

unit.000009:  503.33
unit.000008:  482.82
unit.000003:  470.84
unit.000002:  510.18
unit.000001:  475.43
unit.000000:  513.49
unit.000014:  463.03
unit.000015:  477.81
unit.000005:  508.67
unit.000006:  512.59
unit.000013:  490.78
unit.000010:  475.87
unit.000016:  471.70
unit.000011:  485.46
unit.000004:  511.55
unit.000007:  493.09
unit.000012:  478.49
average    :  489.71

Runtimes for PBS job with 1 forward run and soft link to mesh files: 510s, 448s, 441s, 450s

A few notes:

1) The mesher takes significantly longer in ENTK(unit.000000) than PBS. Not sure what is the cause of this. However, since we are only going to run the mesher once, it is not such a urgent issue. Maybe it is just one incidence. I will keep an eye on it in the future runs.

2) With save_forward = True flag, simpy didn't stage the output files correctly into the right directory. It stays where the job get launched. Also, the ENTK quit with errors even though all the simulation finished with no issues.

Here is some output from ENTK:

...
Syncing task radical.entk.task.0009 with state FAILED
Synced task radical.entk.task.0009 with state FAILED
2018-08-01 01:05:20,112: radical.entk.wfprocessor.0000: wfprocessor                     : dequeue-thread : DEBUG   : radical.entk.task.0009 with state FAILED synced with AppManager
2018-08-01 01:05:20,113: radical.entk.wfprocessor.0000: wfprocessor                     : dequeue-thread : INFO    : Transition of radical.entk.task.0009 to new state FAILED successful
...
...
...
2018-08-01 01:05:20,125: radical.entk.wfprocessor.0000: wfprocessor                     : MainThread     : INFO    : Terminating enqueue-thread
2018-08-01 01:05:20,126: radical.entk.wfprocessor.0000: wfprocessor                     : enqueue-thread : INFO    : Enqueue thread terminated
2018-08-01 01:05:20,129: radical.entk.wfprocessor.0000: wfprocessor                     : enqueue-thread : ERROR   : Error in enqueue-thread: 
Traceback (most recent call last):
  File "/ccs/home/lei/miniconda2/lib/python2.7/site-packages/radical/entk/appman/wfprocessor.py", line 423, in _enqueue
    self._mq_connection.close()
  File "/ccs/home/lei/miniconda2/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 688, in close
    self._flush_output(self._closed_result.is_ready)
  File "/ccs/home/lei/miniconda2/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 440, in _flush_output
    raise exceptions.ConnectionClosed()
ConnectionClosed

Exception in thread enqueue-thread:
Traceback (most recent call last):
  File "/ccs/home/lei/miniconda2/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/ccs/home/lei/miniconda2/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/ccs/home/lei/miniconda2/lib/python2.7/site-packages/radical/entk/appman/wfprocessor.py", line 441, in _enqueue
    raise Error(text=ex)
Error: Error: 

2018-08-01 01:05:20,137: radical.entk.wfprocessor.0000: wfprocessor                     : MainThread     : INFO    : Terminating dequeue-thread
2018-08-01 01:05:20,138: radical.entk.wfprocessor.0000: wfprocessor                     : dequeue-thread : INFO    : Terminated dequeue thread
wjlei1990 commented 6 years ago

The profile test has finished.

In conclusion, through all of our tests, we observed 20-40sec(30sec on average) for each unit runs, in-regardless of the simulation size and duration. In other words, the overhead using ENTK is constant.

We are happy about the result now and the issue will be close.