Sibernetic not benefiting from multiple processors

vellamike commented 10 years ago

When I run on a 4 processor machine (16 cores each) all cores are utilized at approximately 60%. However the frame rate (~3.5FPS with no graphics (-no_g flag)) is indicative of only one processor being effectively utilized.

CORRECTION - The actual frame rate with -no_g is 4.4FPS, the 3.5FPS was with -l_to flag(every 10 steps the entire simulation is halted to write to disk(!).

@skhayrulin reports that on his 8-core i7 processor the simulation executes at 2.5-3FPS. Given that there are 8x the number of cores the approximately 1.5x speedup when compared to the i7 seems unreasonable?

Neurophile commented 10 years ago

What is the model of CPU?

vellamike commented 10 years ago

AMD Opteron 6272 On 17 Dec 2013 02:29, "Neurophile" notifications@github.com wrote:

What is the model of CPU?

— Reply to this email directly or view it on GitHubhttps://github.com/openworm/OpenWorm/issues/156#issuecomment-30721461 .

vellamike commented 10 years ago

I have updated the main issue with some important info, repeated here since I know many of you read this on email:

CORRECTION - The actual frame rate with -no_g is 4.4FPS, the 3.5FPS was with -l_to flag(every 10 steps the entire simulation is halted to write to disk(!).

@skhayrulin reports that on his 8-core i7 processor the simulation executes at 2.5-3FPS. Given that there are 8x the number of cores the approximately 1.5x speedup when compared to the i7 seems unreasonable?

Neurophile commented 10 years ago

There is a definitive, quantitative measure for this. Amdahl's law (http://en.wikipedia.org/wiki/Amdahl's_law) The speedup for n cores vs 1 core is given by S = 1/(B+(1/n)*(1-B)) Where B = % of algorithm that is strictly serial. We first need to normalize any two platforms by comparing one of the strictly serial portions of the code. In the current source, the log line runSort xx.xxx ms corresponds to code in C++ run on the CPU which I am assuming will be single threaded, so that is a good comparison point. On my dual core laptop with -no_g and the terminal window minimized (important because the terminal logging takes significant time to display), I get ~1.9 fps and the sort takes ~15 ms.

vellamike commented 10 years ago

What is cpu utilization like if you do htop? On 17 Dec 2013 03:39, "Neurophile" notifications@github.com wrote:

There is a definitive, quantitative measure for this. Amdahl's law ( http://en.wikipedia.org/wiki/Amdahl's_law) The speedup for n cores vs 1 core is given by S = 1/(B+(1/n)_(1-B)) Where B = % of algorithm that is strictly serial. We first need to normalize any two platforms by comparing one of the strictly serial portions of the code. In the current source, the log line *runSort xx.xxx ms corresponds to code in C++ run on the CPU which I am assuming will be single threaded, so that is a good comparison point. On my dual core laptop with -no_g and the terminal window minimized (important because the terminal logging takes significant time to display), I get ~1.9 fps and the sort takes ~15 ms.

— Reply to this email directly or view it on GitHubhttps://github.com/openworm/OpenWorm/issues/156#issuecomment-30723974 .

Neurophile commented 10 years ago

Also, it is possible that the i7 is taking advantage of the AVX registers which are 256 bits wide, which would reduce the effective core advantage of the Opteron to 4x.

Neurophile commented 10 years ago

CPU on the activity monitor (just a gui frontend for top AFAIK) is 93-94%

a-palyanov commented 10 years ago

By the way, core-i7 has 4 cores, not 8 (it has 8 threads).

2013/12/17 Neurophile notifications@github.com

CPU on the activity monitor (just a gui frontend for top AFAIK) is 95%+

— Reply to this email directly or view it on GitHubhttps://github.com/openworm/OpenWorm/issues/156#issuecomment-30724564 .

vellamike commented 10 years ago

@Neurophile it looks like the Opteron 6272 has AVX registers too

vellamike commented 10 years ago

UPDATE:

Well I'm up to 10.3FPS with the latest version of the code and '-no_g' (and terminal output disabled by detaching tmux screen). However I'm a bit skeptical because all the liquid particles seem to have vanished:

noliquid

Any ideas @skhayrulin ? I've reported this (#159).

I've noticed that CPU utilization is around 50% - Some of this low utilization is probably the cores waiting for the serial code to execute? what do you think @Neurophile?

Here is a typical output from the AMD:

[[ Step 1266 ]]
_runClearBuffers:           3.191 ms
_runHashParticles:          0.379 ms
_runSort:                  15.045 ms
_runSortPostPass:           1.246 ms
_runIndexx:                 0.696 ms
_runIndexPostPass:          1.489 ms
_runFindNeighbors:         10.368 ms
_runPCISPH:                18.537 ms    3 iteration(s)
_readBuffer:               38.866 ms
------------------------------------
_Total_step_time:          89.817 ms
------------------------------------

@Neurophile Looking at Amdahl's law, if we assume that 15ms is serial and the remaining 75ms is parallel, then the serial fraction is 15/(75*64+15) = 0.003 so n ~ S ie approximately linear scaling (this is assuming the above snippet isn't nonsense because all the liquid particles have gone away).

vellamike commented 10 years ago

Some more investigation:

This is from running on a Quad core Intel(R) Core(TM)2 Quad CPU Q8300 @ 2.50GHz (on the master branch, not the broken one which has no liquid)

[[ Step 8 ]]
_runClearBuffers:      17.681 ms
_runHashParticles:      1.663 ms
_runSort:          26.228 ms
_runSortPostPass:       4.589 ms
_runIndexx:            14.676 ms
_runIndexPostPass:      0.909 ms
_runFindNeighbors:    208.145 ms
_runPCISPH:           313.908 ms    3 iteration(s)
_readBuffer:         1193.700 ms
------------------------------------
_Total_step_time:    1781.498 ms
------------------------------------

And compare it to the 64 core AMD:

[[ Step 8 ]]
_runClearBuffers:       5.024 ms
_runHashParticles:      0.558 ms
_runSort:          29.801 ms
_runSortPostPass:       1.945 ms
_runIndexx:             0.710 ms
_runIndexPostPass:      1.477 ms
_runFindNeighbors:     27.507 ms
_runPCISPH:            42.831 ms    3 iteration(s)
_readBuffer:           75.820 ms
------------------------------------
_Total_step_time:     185.672 ms
------------------------------------

Two things to note:

The Quad-core Intel is very slow compared to the i7. It is an older machine but I'm still surprised by how poor the performance is.
The 64 core AMD machine is approximately 10x faster than intel the 4-core, which is roughly what you would expect.

Could it be that the Linux Intel OpenCL1.1 drivers are to blame when compared to what seems to be much better performance on Mac and Windows?

Neurophile commented 10 years ago

Interesting results. Compare _runSort and you can see the single threaded performance on the two boxes is within about 10% (i7 = 26.2, Opt = 29.8). The various math kernels have a boost of 2x to 20x on the Opteron. _inddexx gets the biggest boost with ~20x (0.71 ms vs 14.68 ms).

OpenCL is originally an Apple technology, so it makes sense that it is highly optimized on a Mac. Also, single thread performance on Haswell is much better, my sort completes in ~14 ms. Are you able to run the latest intel SDK on your Opteron setup? http://software.intel.com/en-us/vcsource/tools/opencl-sdk-xe

For rough comparison only, some of these numbers are out of whack due to performance tuning experiments I am in the middle of.

Core-i5 4288u 2 physical cores, 4 threads:

_runClearBuffers:       5.730 ms
_runHashParticles:      0.974 ms
_runSort:          14.383 ms
_runSortPostPass:       2.464 ms
_runIndexx:            24.615 ms **old version, new one runs in ~6.5 ms
_runIndexPostPass:      0.357 ms
_runFindNeighbors:    199.189 ms

vellamike commented 10 years ago

I tried installing the latest Intel SDK and it was a painful and unsuccessful process, this is as far as I got:

CL_PLATFORM_VERSION [0]:    OpenCL 1.2 LINUX
ERROR: No OpenCL devices found

I also tried installing the AMD SDK but this failed, with a more colourful error:

 CL_PLATFORM_VERSION [0]:   OpenCL 1.2 AMD-APP (1214.3)
CL_CONTEXT_PLATFORM [0]: CL_DEVICE_NAME [0]:    AMD Opteron(TM) Processor 6272                 
CL_CONTEXT_PLATFORM [0]: CL_DEVICE_MAX_WORK_GROUP_SIZE [0]:     1024
CL_CONTEXT_PLATFORM [0]: CL_DEVICE_MAX_COMPUTE_UNITS [0]:   64
CL_CONTEXT_PLATFORM [0]: CL_DEVICE_GLOBAL_MEM_SIZE [0]:     2032017408
CL_CONTEXT_PLATFORM [0]: CL_DEVICE_GLOBAL_MEM_CACHE_SIZE [0]:   16384
CL_CONTEXT_PLATFORM [0]: CL_DEVICE_LOCAL_MEM_SIZE [0]:  32768
Compilation failed: 
"/tmp/OCLiZv8ZX.cl", line 8: catastrophic error: cannot open source file
          "src//owOpenCLConstant.h"
  #include "src//owOpenCLConstant.h"
                                    ^

1 catastrophic error detected in the compilation of "/tmp/OCLiZv8ZX.cl".
Compilation terminated.

Frontend phase failed compilation.

ERROR: failed to build program

vellamike commented 10 years ago

I wonder if my AMD errors are related to this. I have asked a question on Stackoverflow.

Neurophile commented 10 years ago

Apparently, including header files is not part of the official openCL standard. Cut and paste the contents of owOpenCLConstant.h to the top of sphFluid.cl and delete the #include statement. There may be a better long term solution, but that should get things working with the AMD SDK. I need to ponder the Intel SDK error a little more.

skhayrulin commented 10 years ago

Sorry @vellamike I forgot about it I turned off outer water. Fix this

skhayrulin commented 10 years ago

I'm not sure but I think that last opencl SDK work only for intel xeon and intel xeon phi processors. SDK contains headers files, compiler for opencl and drivers for work with device I suspect that drivers which @vellamike installed doesn't work for his processor also it wont work with AMD SDK

vellamike commented 10 years ago

I think I got the answer on Stack Overflow: http://stackoverflow.com/questions/20639909/opencl1-2-amd-app-catastrophic-error-on-compile-under-linux?noredirect=1#comment30920470_20639909

vellamike commented 10 years ago

UPDATE I now have it working with AMD OpenCL 1.2 drivers and the performance is up to 5.4FPS.

I would like to think about the theoretical maximum speedup a bit more.

Let's look at a single timestep:

_runClearBuffers:           2.491 ms
_runHashParticles:          0.617 ms
_runSort:                  30.158 ms
_runSortPostPass:           5.069 ms
_runIndexx:                 3.540 ms
_runIndexPostPass:          1.108 ms
_runFindNeighbors:        124.054 ms
_runPCISPH:                73.943 ms    3 iteration(s)
_readBuffer:                5.949 ms
------------------------------------
_Total_step_time:         246.928 ms
------------------------------------

runsort time = 30ms total time = 245ms serial fraction = 30.0/((245-30)*64+30) = 0.002

Plotting the result:

speedup

It's basically linear as expected because the serial component is so small, I get x55 on 64 cores.

So if it runs at 2fps on a dual core device with twice the execution speed (your runsort is approximately 2x faster) we get a theoretical maximum of ~ 27FPS on the Opteron.

@Neurophile - is there any obvious problem with my reasoning here?

msasinski commented 10 years ago

@vellamike it may be helpful to gather as much data as possible so below are my results. Later today I will try to run it on XEON E3-1230 for comparison. I will try to run it with both AMD and INTEL implementations.

What I find interesting that with with your 64 compute units you're getting only about 6x speedup.

run with -no_g flag

CL_PLATFORM_VERSION [0]: OpenCL 1.2 AMD-APP (1113.2) CL_CONTEXT_PLATFORM [0]: CL_DEVICE_NAME [0]: Intel(R) Core(TM) i5 CPU M 450 @ 2.40GHz CL_CONTEXT_PLATFORM [0]: CL_DEVICE_MAX_WORK_GROUP_SIZE [0]: 1024 CL_CONTEXT_PLATFORM [0]: CL_DEVICE_MAX_COMPUTE_UNITS [0]: 4 CL_CONTEXT_PLATFORM [0]: CL_DEVICE_GLOBAL_MEM_SIZE [0]: -357900288 CL_CONTEXT_PLATFORM [0]: CL_DEVICE_GLOBAL_MEM_CACHE_SIZE [0]: 32768 CL_CONTEXT_PLATFORM [0]: CL_DEVICE_LOCAL_MEM_SIZE [0]: 32768

_runClearBuffers:      11.703 ms
_runHashParticles:      5.332 ms
_runSort:          22.240 ms
_runSortPostPass:       6.843 ms
_runIndexx:            41.799 ms
_runIndexPostPass:      0.800 ms
_runFindNeighbors:    802.229 ms
_runPCISPH:           720.334 ms    3 iteration(s)
_readBuffer:           44.190 ms
------------------------------------
_Total_step_time:    1655.471 ms
------------------------------------

vellamike commented 10 years ago

@msasinski is this the very latest version of the code from master?

msasinski commented 10 years ago

Yes, but it's from master. Should it be switched to electrophysiology?

vellamike commented 10 years ago

Nope, master is the latest.

On 18 December 2013 14:21, Mariusz Sasinski notifications@github.comwrote:

Yes, but it's from master. Should it be switched to electrophysiology?

— Reply to this email directly or view it on GitHubhttps://github.com/openworm/OpenWorm/issues/156#issuecomment-30844775 .

msasinski commented 10 years ago

Few observations:

after 700 steps total time is already at 2200ms on average but neither virt, res or shr memory is not changing.
all cores are running @97%

Neurophile commented 10 years ago

@vellamike I think your numbers for Amdahl's are not quite right.

Judging by the numbers from the sample machines we have, and just by looking at the code, the _runPCISPH section has a very high percentage of parallel execution. Start with @msasinski numbers

_runSort:          22.240 ms
_runPCISPH:        720.334 ms

compare to @vellamike numbers

_runSort:          30.158 ms
_runPCISPH:        73.943 ms

First find a normalizing factor using the serial baseline: 30/22=1.36 Apply that to the parallel code to be examined 720*1.36 = 979 There are 4 execution units on the slower machine, 64 on the faster, so a factor of 16. Compare that to the speedup we saw: 979/74 = 13. Not bad!

If we apply the same math to the total step time: 1.36*1655 = 2250 2250 / 247 = 9.1x speedup 64 vs 4 execution units. From this we can back-calculate the B parameter: (1/(.015625+.984375B)) / (1/(0.25+0.75B)) = 9.1 Do all the algebra and B ~ 1.3%

If we do the math with my results (sort = 15 ms, total = 500ms, 4 execution units) the serial portion is closer to 6%

There are some confounding factors here, so be cautious about any conclusions you draw. Portions of our code are memory intensive, this will have a separate performance factor than the CPU parts and throw off any scaling.

Neurophile commented 10 years ago

TL;DR version The results to date indicate good use of all cores available.

vellamike commented 10 years ago

Doesn't this confirm my theoretical calculations? On 18 Dec 2013 21:14, "Neurophile" notifications@github.com wrote:

@vellamike https://github.com/vellamike I think your numbers for Amdahl's are not quite right.

Judging by the numbers from the sample machines we have, and just by looking at the code, the _runPCISPH section has a very high percentage of parallel execution. Start with @msasinski https://github.com/msasinskinumbers

_runSort: 22.240 ms _runPCISPH: 720.334 ms

compare to @vellamike https://github.com/vellamike numbers

_runSort: 30.158 ms _runPCISPH: 73.943 ms

First find a normalizing factor using the serial baseline: 30/22=1.36 Apply that to the parallel code to be examined 720*1.36 = 979 There are 4 execution units on the slower machine, 64 on the faster, so a factor of 16. Compare that to the speedup we saw: 979/74 = 13. Not bad!

If we apply the same math to the total step time: 1.36*1655 = 2250 2250 / 247 = 9.1x speedup 64 vs 4 execution units. From this we can back-calculate the B parameter: (1/(.015625+.984375B)) / (1/(0.25+0.75B)) = 9.1 Do all the algebra and B ~ 1.3%

If we do the math with my results (sort = 15 ms, total = 500ms, 4 execution units) the serial portion is closer to 6%

There are some confounding factors here, so be cautious about any conclusions you draw. Portions of our code are memory intensive, this will have a separate performance factor than the CPU parts and throw off any scaling.

— Reply to this email directly or view it on GitHubhttps://github.com/openworm/OpenWorm/issues/156#issuecomment-30881364 .

Neurophile commented 10 years ago

Not exactly. Your graph is based on 0.2% which assumed that runSort was the only contributor of serial code. I don't think that assumption is justified. Replot using 1.3% and again at 6%. The asymptote for speedup at .2% is 500x, 1.3% => 77x, 6% => 17x

Amdahl's reveals the high sensitivity to the serial proportion and the rapidly diminishing returns of increasing core count for algorithms with even a few percent of serial dependence. To get a more accurate number for B we would have to somehow force the OpenCL code to run single-threaded. and compare that to n threads on the same machine.

vellamike commented 10 years ago

I see what you mean now. I'm a bit perplexed that @msasinski appears to have much worse performance than @Neurophile with the same i5 processor?

On 19 December 2013 04:28, Neurophile notifications@github.com wrote:

Not exactly. Your graph is based on 0.2% which assumed that runSort was the only contributor of serial code. I don't think that assumption is justified. Replot using 1.3% and again at 6%. The asymptote for speedup at .2% is 500x, 1.3% => 77x, 6% => 17x

Amdahl's reveals the high sensitivity to the serial proportion and the rapidly diminishing returns of increasing core count for algorithms with even a few percent of serial dependence. To get a more accurate number for B we would have to somehow force the OpenCL code to run single-threaded. and compare that to n threads on the same machine.

— Reply to this email directly or view it on GitHubhttps://github.com/openworm/OpenWorm/issues/156#issuecomment-30904419 .

msasinski commented 10 years ago

@vellamike In reality these to i5 are totally different processors, starting with lithography (22 vs 32nm), architecture, memory bandwidth etc made 3 years appart.

vellamike commented 10 years ago

Nonetheless @msasinski, compare your i5:

_runClearBuffers:      11.703 ms
_runHashParticles:      5.332 ms
_runSort:          22.240 ms
_runSortPostPass:       6.843 ms
_runIndexx:            41.799 ms
_runIndexPostPass:      0.800 ms
_runFindNeighbors:    802.229 ms
_runPCISPH:           720.334 ms    3 iteration(s)
_readBuffer:           44.190 ms
------------------------------------
_Total_step_time:    1655.471 ms
------------------------------------

With that from @Neurophile:

_runClearBuffers:       5.730 ms
_runHashParticles:      0.974 ms
_runSort:          14.383 ms
_runSortPostPass:       2.464 ms
_runIndexx:            24.615 ms **old version, new one runs in ~6.5 ms
_runIndexPostPass:      0.357 ms
_runFindNeighbors:    199.189 ms

And the difference is quite remarkablee. OTOH @Neurophile was running some tweaked version of the code so perhaps this information is not a fair comparison.

vellamike commented 10 years ago

@Neurophile I have done as you asked and plotted for various serial components of the computation, the results are quite striking:

speedup2

Neurophile commented 10 years ago

As usual, a graph really brings it home. It is astonishing how a tiny bit of serial dependence can negatively impact the benefit of multiple cores.

vellamike commented 10 years ago

I would suggest that my initial claim that "the frame rate (~3.5FPS with no graphics (-no_g flag)) is indicative of only one processor being effectively utilized." is incorrect - we have discovered in our investigations (as detailed above) that a combination of factors (type of processor, nonlinear scaling) are probably the cause of slower-than-expected operation.

If you guys agree then I will mark this issue as closed.

openworm / OpenWorm

Sibernetic not benefiting from multiple processors #156