Open cathieO opened 5 years ago
Status:
To do:
Initial Performance testing on recilinear domain (sinusoidal) on rose. Domain size 300x300x250 Tested sizes: 2, 4, 8, 16, 32, 64, 128, 256, 512 Tested threads (when applicable): 1, 2, 4, 8 Reported from an average of 20 trials.
Baseline is the is same domain using the default GrGeomInLoopBoxes macro: GrGeomInLoopBoxes time statistics:
Time Metric | means (seconds) | ± 1 st.dev (seconds) |
---|---|---|
real | 187.5037 | 0.4308 |
user | 170.6810 | 0.4656 |
sys | 16.8000 | 0.1567 |
No parallelism (exploring overhead purely from tiled looping)
case | min speedup | max speedup | average speedup | reduction in time | reduction in stddev |
---|---|---|---|---|---|
GrGeomInLoopBoxesTotalDomainTiled_2 | 0.9033 | 0.9104 | 0.9068 | -19.2625 | 0.0923 |
GrGeomInLoopBoxesTotalDomainTiled_4 | 0.9452 | 0.9544 | 0.9498 | -9.9124 | -0.0691 |
GrGeomInLoopBoxesTotalDomainTiled_8 | 0.9561 | 0.9665 | 0.9613 | -7.5503 | -0.1722 |
GrGeomInLoopBoxesTotalDomainTiled_16 | 0.9626 | 0.9725 | 0.9675 | -6.2886 | -0.1137 |
GrGeomInLoopBoxesTotalDomainTiled_32 | 0.9595 | 0.9689 | 0.9642 | -6.9658 | -0.0640 |
GrGeomInLoopBoxesTotalDomainTiled_64 | 0.9608 | 0.9679 | 0.9644 | -6.9310 | 0.1579 |
GrGeomInLoopBoxesTotalDomainTiled_128 | 0.9729 | 0.9832 | 0.9781 | -4.2064 | -0.1421 |
GrGeomInLoopBoxesTotalDomainTiled_256 | 0.9815 | 0.9938 | 0.9876 | -2.3461 | -0.3104 |
GrGeomInLoopBoxesTotalDomainTiled_512 | 0.9948 | 1.0035 | 0.9992 | -0.1583 | 0.0424 |
Parallelism in tiles Tile size 2
case | min speedup | max speedup | average speedup | reduction in time | reduction in stddev |
---|---|---|---|---|---|
GrGeomInLoopBoxesTotalDomainTiledParallelInTiles_2_threads-1 | 0.7353 | 0.7448 | 0.7400 | -65.8830 | -0.6141 |
GrGeomInLoopBoxesTotalDomainTiledParallelInTiles_2_threads-2 | 0.6642 | 0.6745 | 0.6693 | -92.6416 | -1.0762 |
GrGeomInLoopBoxesTotalDomainTiledParallelInTiles_2_threads-4 | 0.6606 | 0.6695 | 0.6650 | -94.4418 | -0.8256 |
GrGeomInLoopBoxesTotalDomainTiledParallelInTiles_2_threads-8 | 0.5723 | 0.5782 | 0.5752 | -138.4593 | -0.4704 |
Parallelism in tiles Tile size 4
case | min speedup | max speedup | average speedup | reduction in time | reduction in stddev |
---|---|---|---|---|---|
GrGeomInLoopBoxesTotalDomainTiledParallelInTiles_4_threads-1 | 0.9019 | 0.9121 | 0.9070 | -19.2238 | -0.2523 |
GrGeomInLoopBoxesTotalDomainTiledParallelInTiles_4_threads-2 | 1.0313 | 1.0422 | 1.0367 | 6.6463 | -0.0999 |
GrGeomInLoopBoxesTotalDomainTiledParallelInTiles_4_threads-4 | 1.1205 | 1.1320 | 1.1263 | 21.0198 | -0.0395 |
GrGeomInLoopBoxesTotalDomainTiledParallelInTiles_4_threads-8 | 1.1393 | 1.1503 | 1.1448 | 23.7142 | 0.0222 |
Parallelism In tiles Tile size 8
case | min speedup | max speedup | average speedup | reduction in time | reduction in stddev |
---|---|---|---|---|---|
GrGeomInLoopBoxesTotalDomainTiledParallelInTiles_8_threads-1 | 0.9342 | 0.9434 | 0.9388 | -12.2230 | -0.0865 |
GrGeomInLoopBoxesTotalDomainTiledParallelInTiles_8_threads-2 | 1.1093 | 1.1200 | 1.1146 | 19.2844 | 0.0031 |
GrGeomInLoopBoxesTotalDomainTiledParallelInTiles_8_threads-4 | 1.2242 | 1.2383 | 1.2312 | 35.2125 | -0.0934 |
GrGeomInLoopBoxesTotalDomainTiledParallelInTiles_8_threads-8 | 1.2850 | 1.3037 | 1.2943 | 42.6340 | -0.2864 |
Parallelism In tiles Tile size 16
case | min speedup | max speedup | average speedup | reduction in time | reduction in stddev |
---|---|---|---|---|---|
GrGeomInLoopBoxesTotalDomainTiledParallelInTiles_16_threads-1 | 0.9402 | 0.9502 | 0.9452 | -10.8767 | -0.1591 |
GrGeomInLoopBoxesTotalDomainTiledParallelInTiles_16_threads-2 | 1.1244 | 1.1353 | 1.1298 | 21.5446 | 0.0122 |
GrGeomInLoopBoxesTotalDomainTiledParallelInTiles_16_threads-4 | 1.2479 | 1.2613 | 1.2546 | 38.0464 | -0.0239 |
GrGeomInLoopBoxesTotalDomainTiledParallelInTiles_16_threads-8 | 1.3202 | 1.3331 | 1.3266 | 46.1635 | 0.0709 |
Parallelism in tiles Tile size 32
case | min speedup | max speedup | average speedup | reduction in time | reduction in stddev |
---|---|---|---|---|---|
GrGeomInLoopBoxesTotalDomainTiledParallelInTiles_32_threads-1 | 0.9389 | 0.9478 | 0.9433 | -11.2632 | -0.0598 |
GrGeomInLoopBoxesTotalDomainTiledParallelInTiles_32_threads-2 | 1.1208 | 1.1348 | 1.1278 | 21.2432 | -0.2215 |
GrGeomInLoopBoxesTotalDomainTiledParallelInTiles_32_threads-4 | 1.2466 | 1.2611 | 1.2538 | 37.9572 | -0.0878 |
GrGeomInLoopBoxesTotalDomainTiledParallelInTiles_32_threads-8 | 1.3173 | 1.3326 | 1.3249 | 45.9847 | -0.0631 |
Parallelism in tiles Tile size 64
case | min speedup | max speedup | average speedup | reduction in time | reduction in stddev |
---|---|---|---|---|---|
GrGeomInLoopBoxesTotalDomainTiledParallelInTiles_64_threads-1 | 0.9360 | 0.9449 | 0.9405 | -11.8725 | -0.0480 |
GrGeomInLoopBoxesTotalDomainTiledParallelInTiles_64_threads-2 | 1.1215 | 1.1312 | 1.1263 | 21.0317 | 0.0914 |
GrGeomInLoopBoxesTotalDomainTiledParallelInTiles_64_threads-4 | 1.2419 | 1.2559 | 1.2489 | 37.3652 | -0.0661 |
GrGeomInLoopBoxesTotalDomainTiledParallelInTiles_64_threads-8 | 1.3096 | 1.3256 | 1.3176 | 45.1935 | -0.1052 |
Parallelism in tiles Tile size 128
case | min speedup | max speedup | average speedup | reduction in time | reduction in stddev |
---|---|---|---|---|---|
GrGeomInLoopBoxesTotalDomainTiledParallelInTiles_128_threads-1 | 0.9532 | 0.9609 | 0.9571 | -8.4121 | 0.0976 |
GrGeomInLoopBoxesTotalDomainTiledParallelInTiles_128_threads-2 | 1.1337 | 1.1444 | 1.1391 | 22.8939 | 0.0353 |
GrGeomInLoopBoxesTotalDomainTiledParallelInTiles_128_threads-4 | 1.2549 | 1.2660 | 1.2604 | 38.7386 | 0.1161 |
GrGeomInLoopBoxesTotalDomainTiledParallelInTiles_128_threads-8 | 1.3208 | 1.3334 | 1.3271 | 46.2155 | 0.0865 |
Parallelism in tiles Tile size 256
case | min speedup | max speedup | average speedup | reduction in time | reduction in stddev |
---|---|---|---|---|---|
GrGeomInLoopBoxesTotalDomainTiledParallelInTiles_256_threads-1 | 0.9623 | 0.9715 | 0.9669 | -6.4231 | -0.0446 |
GrGeomInLoopBoxesTotalDomainTiledParallelInTiles_256_threads-2 | 1.1406 | 1.1527 | 1.1466 | 23.9787 | -0.0518 |
GrGeomInLoopBoxesTotalDomainTiledParallelInTiles_256_threads-4 | 1.2588 | 1.2716 | 1.2652 | 39.3022 | 0.0234 |
GrGeomInLoopBoxesTotalDomainTiledParallelInTiles_256_threads-8 | 1.3239 | 1.3396 | 1.3317 | 46.7055 | -0.0715 |
Parallelism in tiles Tile size 512
case | min speedup | max speedup | average speedup | reduction in time | reduction in stddev |
---|---|---|---|---|---|
GrGeomInLoopBoxesTotalDomainTiledParallelInTiles_512_threads-1 | 0.9680 | 0.9797 | 0.9739 | -5.0336 | -0.2865 |
GrGeomInLoopBoxesTotalDomainTiledParallelInTiles_512_threads-2 | 1.1486 | 1.1583 | 1.1534 | 24.9424 | 0.1206 |
GrGeomInLoopBoxesTotalDomainTiledParallelInTiles_512_threads-4 | 1.2646 | 1.2767 | 1.2706 | 39.9362 | 0.0655 |
GrGeomInLoopBoxesTotalDomainTiledParallelInTiles_512_threads-8 | 1.3272 | 1.3440 | 1.3356 | 47.1131 | -0.1275 |
Parallelism over tiles Tile size 2
case | min speedup | max speedup | average speedup | reduction in time | reduction in stddev |
---|---|---|---|---|---|
GrGeomInLoopBoxesTotalDomainTiledParallelOverTiles_2_threads-1 | 0.8932 | 0.9018 | 0.8975 | -21.4199 | -0.0843 |
GrGeomInLoopBoxesTotalDomainTiledParallelOverTiles_2_threads-2 | 1.0896 | 1.1015 | 1.0955 | 16.3506 | -0.0991 |
GrGeomInLoopBoxesTotalDomainTiledParallelOverTiles_2_threads-4 | 1.2222 | 1.2390 | 1.2306 | 35.1308 | -0.2563 |
GrGeomInLoopBoxesTotalDomainTiledParallelOverTiles_2_threads-8 | 1.3030 | 1.3212 | 1.3121 | 44.5981 | -0.2333 |
Parallelism over tiles Tile size 4
case | min speedup | max speedup | average speedup | reduction in time | reduction in stddev |
---|---|---|---|---|---|
GrGeomInLoopBoxesTotalDomainTiledParallelOverTiles_4_threads-1 | 0.9335 | 0.9442 | 0.9388 | -12.2132 | -0.2523 |
GrGeomInLoopBoxesTotalDomainTiledParallelOverTiles_4_threads-2 | 1.1198 | 1.1284 | 1.1241 | 20.7010 | 0.1729 |
GrGeomInLoopBoxesTotalDomainTiledParallelOverTiles_4_threads-4 | 1.2427 | 1.2572 | 1.2499 | 37.4871 | -0.0947 |
GrGeomInLoopBoxesTotalDomainTiledParallelOverTiles_4_threads-8 | 1.3103 | 1.3290 | 1.3196 | 45.4125 | -0.2507 |
Parallelism over tiles Tile size 8
case | min speedup | max speedup | average speedup | reduction in time | reduction in stddev |
---|---|---|---|---|---|
GrGeomInLoopBoxesTotalDomainTiledParallelOverTiles_8_threads-1 | 0.9465 | 0.9560 | 0.9513 | -9.6060 | -0.1043 |
GrGeomInLoopBoxesTotalDomainTiledParallelOverTiles_8_threads-2 | 1.1214 | 1.1370 | 1.1292 | 21.4489 | -0.3321 |
GrGeomInLoopBoxesTotalDomainTiledParallelOverTiles_8_threads-4 | 1.2423 | 1.2599 | 1.2510 | 37.6256 | -0.2816 |
GrGeomInLoopBoxesTotalDomainTiledParallelOverTiles_8_threads-8 | 1.3156 | 1.3305 | 1.3230 | 45.7819 | -0.0445 |
Parallelism over tiles Tile size 16
case | min speedup | max speedup | average speedup | reduction in time | reduction in stddev |
---|---|---|---|---|---|
GrGeomInLoopBoxesTotalDomainTiledParallelOverTiles_16_threads-1 | 0.9566 | 0.9654 | 0.9610 | -7.6164 | -0.0112 |
GrGeomInLoopBoxesTotalDomainTiledParallelOverTiles_16_threads-2 | 1.1292 | 1.1425 | 1.1359 | 22.4283 | -0.1568 |
GrGeomInLoopBoxesTotalDomainTiledParallelOverTiles_16_threads-4 | 1.2451 | 1.2617 | 1.2534 | 37.9051 | -0.2177 |
GrGeomInLoopBoxesTotalDomainTiledParallelOverTiles_16_threads-8 | 1.3157 | 1.3252 | 1.3204 | 45.5006 | 0.2445 |
Parallelism over tiles Tile size 32
case | min speedup | max speedup | average speedup | reduction in time | reduction in stddev |
---|---|---|---|---|---|
GrGeomInLoopBoxesTotalDomainTiledParallelOverTiles_32_threads-1 | 0.9519 | 0.9615 | 0.9567 | -8.4842 | -0.1007 |
GrGeomInLoopBoxesTotalDomainTiledParallelOverTiles_32_threads-2 | 1.1185 | 1.1320 | 1.1252 | 20.8675 | -0.1902 |
GrGeomInLoopBoxesTotalDomainTiledParallelOverTiles_32_threads-4 | 1.2379 | 1.2541 | 1.2460 | 37.0168 | -0.1988 |
GrGeomInLoopBoxesTotalDomainTiledParallelOverTiles_32_threads-8 | 1.3089 | 1.3254 | 1.3171 | 45.1446 | -0.1333 |
Parallelism over tiles Tile size 64
case | min speedup | max speedup | average speedup | reduction in time | reduction in stddev |
---|---|---|---|---|---|
GrGeomInLoopBoxesTotalDomainTiledParallelOverTiles_64_threads-1 | 0.9487 | 0.9625 | 0.9556 | -8.7184 | -0.5336 |
GrGeomInLoopBoxesTotalDomainTiledParallelOverTiles_64_threads-2 | 1.1304 | 1.1382 | 1.1343 | 22.1994 | 0.2461 |
GrGeomInLoopBoxesTotalDomainTiledParallelOverTiles_64_threads-4 | 1.2445 | 1.2607 | 1.2526 | 37.8064 | -0.1892 |
GrGeomInLoopBoxesTotalDomainTiledParallelOverTiles_64_threads-8 | 1.3083 | 1.3175 | 1.3129 | 44.6905 | 0.2574 |
Parallelism over tiles Tile size 128
case | min speedup | max speedup | average speedup | reduction in time | reduction in stddev |
---|---|---|---|---|---|
GrGeomInLoopBoxesTotalDomainTiledParallelOverTiles_128_threads-1 | 0.9675 | 0.9778 | 0.9727 | -5.2719 | -0.1424 |
GrGeomInLoopBoxesTotalDomainTiledParallelOverTiles_128_threads-2 | 1.1400 | 1.1509 | 1.1454 | 23.8074 | 0.0343 |
GrGeomInLoopBoxesTotalDomainTiledParallelOverTiles_128_threads-4 | 1.1798 | 1.2049 | 1.1923 | 30.2350 | -0.8657 |
GrGeomInLoopBoxesTotalDomainTiledParallelOverTiles_128_threads-8 | 1.2723 | 1.2850 | 1.2787 | 40.8636 | 0.0410 |
Parallelism over tiles Tile size 256
case | min speedup | max speedup | average speedup | reduction in time | reduction in stddev |
---|---|---|---|---|---|
GrGeomInLoopBoxesTotalDomainTiledParallelOverTiles_256_threads-1 | 0.9791 | 0.9873 | 0.9832 | -3.2074 | 0.0721 |
GrGeomInLoopBoxesTotalDomainTiledParallelOverTiles_256_threads-2 | 1.0206 | 1.0302 | 1.0254 | 4.6405 | 0.0006 |
GrGeomInLoopBoxesTotalDomainTiledParallelOverTiles_256_threads-4 | 1.0833 | 1.0937 | 1.0885 | 15.2403 | 0.0071 |
GrGeomInLoopBoxesTotalDomainTiledParallelOverTiles_256_threads-8 | 1.0792 | 1.0953 | 1.0872 | 15.0403 | -0.4576 |
Parallelism over tiles Tile size 512
case | min speedup | max speedup | average speedup | reduction in time | reduction in stddev |
---|---|---|---|---|---|
GrGeomInLoopBoxesTotalDomainTiledParallelOverTiles_512_threads-1 | 0.9864 | 0.9980 | 0.9922 | -1.4791 | -0.2385 |
GrGeomInLoopBoxesTotalDomainTiledParallelOverTiles_512_threads-2 | 0.9848 | 0.9955 | 0.9901 | -1.8662 | -0.1588 |
GrGeomInLoopBoxesTotalDomainTiledParallelOverTiles_512_threads-4 | 0.9850 | 0.9949 | 0.9899 | -1.9137 | -0.0813 |
GrGeomInLoopBoxesTotalDomainTiledParallelOverTiles_512_threads-8 | 0.9863 | 0.9944 | 0.9903 | -1.8304 | 0.0839 |
Main Observations
Idea is to choose a single size that maximizes both. (Could we also choose a better box size that may be more optimal for one case if the gains are much better of that case than the losses in the alternative case? Eg, if at size 16 both in and over boxes achieve 1.32x speedup, but at size 32, in boxes achieves 2.0x speedup but over boxes only achieves 1.1x speedup)
Future work: Immediately: Plot the speedup for 8 threads across tile size for the two cases.
Soon: Box size analysis using...
Test on Ocelote?
Should parallelism be done over boxes or within?
This depends on the configuration of the domain.