Closed wjlei1990 closed 6 years ago
Agree upon a dedicated meeting to introduce analytics and see whether we should define a dedicated workflow
@wjlei1990 Incase you are interested: https://github.com/radical-cybertools/radical.analytics
I have run a few example scripts from radical.analytics
. The output log is here.
According to our last discussion, I have run SPECFEM3D GLOBE forward simulation without ENTK to test the run time. The time is listed below:
Type | Number of nodes | Time |
---|---|---|
ENTK with 4 forward runs | 4*384 | 754 - 768 sec |
PBS job with 1 forward run | 384 | 284 sec |
PBS job with 1 forward run and soft link to mesh files | 384 | 327 sec |
The run time for ENTK is taken from here.
The Single Run with soft link to mesh files
is to mimic the run procedures of ENTK, where I put the run directory in the sandbox and mesh files on stage out directory(which is another filesystem). This one takes slightly long time compared to the Single Run
, in which case I put all the files in the same directory.
The raw profile files(41) is attached. [rp.session.titan-ext4.lei.017645.0001.41.zip](https://github.com/radical-collaboration/hpc-workflows/files/2095936/rp.session.titan-ext4.lei.017645.0001.4.1.zip)
Wenjie to run the EntK with multiple pilot sizes / run numbers. The idea is to factor out what is due to startup overhead and what might be a scalability problem. number of forward runs: 1 and maybe 2, 4, 8 and other numbers
Hey Wenjie, do share the raw profiles as well. Feel free to upload it as zip file to this ticket.
Another test run is finished. In this job, only 1 forward simuation runs at one time and there are 4 serial runs, so in total of 1 * 4 = 4 forward runs.
By runing 06_events.py
, it is shown:
------------------------------------------------------------------------------
pure exec times (exec_start ... exec_stop)
------------------------------------------------------------------------------
unit.000003: 395.31
unit.000002: 366.87
unit.000001: 379.70
unit.000000: 500.91
unit.000004: 371.31
average : 402.82
The first unit with rank 0 is the mesher.
The raw profile files(14) is attached. [rp.session.titan-ext1.lei.017694.0006.14.zip](https://github.com/radical-collaboration/hpc-workflows/files/2095935/rp.session.titan-ext1.lei.017694.0006.1.4.zip)
@wjlei1990 Quick question: how did you time:
@mpbl
I am using command in the job submission script to measure the time:
echo *Start time: `date`*
aprun -n 384 ./bin/xspecfem3D
echo *End time: `date`*
Another test run is finished. 2 forward simulations are running at the same time and 2 serial runs. So in total 2 * 2 = 4 forward simulations.
By runing 06_events.py
, it is shown:
------------------------------------------------------------------------------
pure exec times (exec_start ... exec_stop)
------------------------------------------------------------------------------
unit.000003: 380.92
unit.000002: 363.54
unit.000001: 366.96
unit.000000: 509.10
unit.000004: 356.34
average : 395.37
Which is comparable to the 1*4
runs.
The raw profile files(22) is attached. [rp.session.titan-ext1.lei.017694.0008.22.zip](https://github.com/radical-collaboration/hpc-workflows/files/2099559/rp.session.titan-ext1.lei.017694.0008.2.2.zip)
I rerun the 4*1
test. The new test shows different result.
------------------------------------------------------------------------------
pure exec times (exec_start ... exec_stop)
------------------------------------------------------------------------------
unit.000003: 1068.38
unit.000002: 446.59
unit.000001: 488.00
unit.000000: 1191.80
unit.000004: 383.74
average : 715.70
The raw profile files(41) are attached. [rp.session.titan-ext4.lei.017695.0001.41.zip](https://github.com/radical-collaboration/hpc-workflows/files/2099589/rp.session.titan-ext4.lei.017695.0001.4.1.zip)
Currently, I am running a 4*4
test on Titan.
The 4*4
test run just went through.
------------------------------------------------------------------------------
pure exec times (exec_start ... exec_stop)
------------------------------------------------------------------------------
unit.000009: 366.98
unit.000008: 462.65
unit.000003: 400.08
unit.000002: 388.48
unit.000001: 368.57
unit.000000: 502.99
unit.000014: 1332.23
unit.000015: 359.94
unit.000005: 366.50
unit.000006: 389.23
unit.000013: 377.17
unit.000010: 365.93
unit.000016: 403.79
unit.000011: 357.54
unit.000004: 373.83
unit.000007: 374.15
unit.000012: 404.69
average : 446.75
Most of the unit seems normal except unit.000014
.
The raw profile files(44) are attached. [rp.session.titan-ext4.lei.017695.0008.44.zip](https://github.com/radical-collaboration/hpc-workflows/files/2100646/rp.session.titan-ext4.lei.017695.0008.4.4.zip)
We just checked the unit.000014
and the reason it lasts so long is due to the GPU node failure.
@wjlei1990 Would you mind checking the mean time per time step to see if this discrepancy for unit 14 is due to a bad node?
Another 4*4
run with resubmit=True
------------------------------------------------------------------------------
pure exec times (exec_start ... exec_stop)
------------------------------------------------------------------------------
unit.000009: 366.41
unit.000008: 361.24
unit.000003: 381.02
unit.000002: 359.46
unit.000001: 370.68
unit.000000: 501.79
unit.000014: 375.48
unit.000015: 369.53
unit.000005: 381.34
unit.000006: 360.39
unit.000013: 377.61
unit.000010: 358.57
unit.000016: 364.44
unit.000011: 362.21
unit.000004: 362.66
unit.000007: 355.17
unit.000012: 354.79
average : 374.28
We should add a test with a twice shorter physical duration to understand if the overhead is 10% of the run or a fixed 30s.
4*4
test with 60 min
(50% time length of previous runs) seismograms:
------------------------------------------------------------------------------
pure exec times (exec_start ... exec_stop)
------------------------------------------------------------------------------
unit.000009: 237.33
unit.000008: 235.46
unit.000003: 250.06
unit.000002: 241.07
unit.000001: 254.68
unit.000000: 533.17
unit.000014: 267.83
unit.000015: 236.63
unit.000005: 237.25
unit.000006: 234.15
unit.000013: 255.10
unit.000010: 240.27
unit.000016: 237.81
unit.000011: 238.46
unit.000004: 256.94
unit.000007: 235.97
unit.000012: 236.39
average : 260.50
rp.session.titan-ext4.lei.017714.0002.4*4.60min.zip
Type | Number of nodes | Time |
---|---|---|
ENTK with 4 forward runs | 4*384 | 234 - 267 sec |
PBS job with 1 forward run | 384 | 178 sec |
PBS job with 1 forward run and soft link to mesh files | 384 | 165 - 205 sec |
Data Point for 3rd case: 197sec, 165sec, 195sec, 205sec
Overhead seems constant, as expected by the Radical team. We should launch a full-sized production job as a final validation. (17s shortest period, 384 GPUs).
17s resolution using 384 GPU nodes, 120min simulation, SAVE_FORWARD on
Type | Number of nodes | Time |
---|---|---|
ENTK with 4*4 forward runs | 4*384 | 463 - 512 sec |
PBS job with 1 forward run | 384 | 436 sec |
PBS job with 1 forward run and soft link to mesh files | 384 | 441 - 510 sec |
Detailed log for each unit in ENTK:
------------------------------------------------------------------------------
pure exec times (exec_start ... exec_stop)
------------------------------------------------------------------------------
unit.000009: 503.33
unit.000008: 482.82
unit.000003: 470.84
unit.000002: 510.18
unit.000001: 475.43
unit.000000: 513.49
unit.000014: 463.03
unit.000015: 477.81
unit.000005: 508.67
unit.000006: 512.59
unit.000013: 490.78
unit.000010: 475.87
unit.000016: 471.70
unit.000011: 485.46
unit.000004: 511.55
unit.000007: 493.09
unit.000012: 478.49
average : 489.71
Runtimes for PBS job with 1 forward run and soft link to mesh files
: 510s, 448s, 441s, 450s
1) The mesher takes significantly longer in ENTK
(unit.000000) than PBS
. Not sure what is the cause of this. However, since we are only going to run the mesher once, it is not such a urgent issue. Maybe it is just one incidence. I will keep an eye on it in the future runs.
2) With save_forward = True
flag, simpy
didn't stage the output files correctly into the right directory. It stays where the job get launched. Also, the ENTK
quit with errors even though all the simulation finished with no issues.
Here is some output from ENTK
:
...
Syncing task radical.entk.task.0009 with state FAILED
Synced task radical.entk.task.0009 with state FAILED
2018-08-01 01:05:20,112: radical.entk.wfprocessor.0000: wfprocessor : dequeue-thread : DEBUG : radical.entk.task.0009 with state FAILED synced with AppManager
2018-08-01 01:05:20,113: radical.entk.wfprocessor.0000: wfprocessor : dequeue-thread : INFO : Transition of radical.entk.task.0009 to new state FAILED successful
...
...
...
2018-08-01 01:05:20,125: radical.entk.wfprocessor.0000: wfprocessor : MainThread : INFO : Terminating enqueue-thread
2018-08-01 01:05:20,126: radical.entk.wfprocessor.0000: wfprocessor : enqueue-thread : INFO : Enqueue thread terminated
2018-08-01 01:05:20,129: radical.entk.wfprocessor.0000: wfprocessor : enqueue-thread : ERROR : Error in enqueue-thread:
Traceback (most recent call last):
File "/ccs/home/lei/miniconda2/lib/python2.7/site-packages/radical/entk/appman/wfprocessor.py", line 423, in _enqueue
self._mq_connection.close()
File "/ccs/home/lei/miniconda2/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 688, in close
self._flush_output(self._closed_result.is_ready)
File "/ccs/home/lei/miniconda2/lib/python2.7/site-packages/pika/adapters/blocking_connection.py", line 440, in _flush_output
raise exceptions.ConnectionClosed()
ConnectionClosed
Exception in thread enqueue-thread:
Traceback (most recent call last):
File "/ccs/home/lei/miniconda2/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/ccs/home/lei/miniconda2/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "/ccs/home/lei/miniconda2/lib/python2.7/site-packages/radical/entk/appman/wfprocessor.py", line 441, in _enqueue
raise Error(text=ex)
Error: Error:
2018-08-01 01:05:20,137: radical.entk.wfprocessor.0000: wfprocessor : MainThread : INFO : Terminating dequeue-thread
2018-08-01 01:05:20,138: radical.entk.wfprocessor.0000: wfprocessor : dequeue-thread : INFO : Terminated dequeue thread
The profile test has finished.
In conclusion, through all of our tests, we observed 20-40sec(30sec on average) for each unit runs, in-regardless of the simulation size and duration. In other words, the overhead using ENTK is constant.
We are happy about the result now and the issue will be close.
Is it possible to provide the users with tools that will visualize the profile files? such as blocks and time lines shows when did job start and end, how much resource did each job end? So it will be easier for users to check the status of jobs.