CPU still beats up GPU by 4x in biotonic sort. what could be the reason

GoogleCodeExporter commented 9 years ago


What steps will reproduce the problem?
  Well I went through the forum page http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=141035
 So i coded biotonic sort and tried to achieve best.
The code is attached and is some what able to do better than the posted one.

What is the expected output? What do you see instead?
  I was expecting to have GPU doing better than CPU. but in the end CPU still beats up GPU by 4x.

What version of the product are you using? On what operating system?
I am using Windows 7 x64. Java JDK 7 and latest Aparapi.
Hardware used are:-
Intel(R) Core(TM) i3 CPU M 370 @ 2.40GHz ( or Intel Core i3 370M)
ATI Mobility Radeon HD 5400 Series(1 GB Memory) GPU 
4GB DDR3 RAM

Please provide any additional information below.
 same code is running on CPU as well as GPU, with array size= 4194304 with each element less than 1000000.

Got results in 2.2 Seconds with CPU, while GPU takes 11 seconds.

Vivek Kumar Chaubey

Original issue reported on code.google.com by vivek.ku...@gmail.com on 17 Dec 2011 at 2:31

Attachments:

BitonicSort.java

GoogleCodeExporter commented 9 years ago

Vivek

Thanks for the code I think it will spawn a couple more bug reports! ;)

First your code exposed a bytecode to OpenCL conversion error for me (how did 
you run it?).  That was sad but I will add the pattern to the JUNIT test suite 
and see if I can see what is going on.  javac (Oracle) optimizes back branches 
if nested conditionals do not contain elses. 

So 
if (cond1){
   if (cond2){
     ...
   }
}else{
   if (cond3){
     ...
   }
}

I had seen this previously and *thought* I had fixed it, clearly not. 

To workaround this I just added dummy else branches to your run() method. 
Something like

if (cond1){
   if (cond2){
      ...
   }else{
      temp=temp;
   }
}else{
   if (cond3){
      ...
   }else{
      temp=temp;
   }
}

Oh and I had to initialize temp to 0;

Now the code will run using OpenCL on CPU and GPU ;)

Sadly when array length was > 2^20 (1048576 ints) the OpenCL version of the 
code was 'failing' your sanity test. Not sure why this would be.  I need to dig 
into this.  Anyway for the time being I set array length to 2^20. 

These are the #'2 I got after making these changes:
Arraylength = 1048576
SEQ:  5952 ms   // Aparapi emulating sequential code
JTP:  1869 ms   // Aparapi thread pool (I have 6 cores but aparapi only uses 4 
- power of 2)) 
CPU:  1571 ms   // Aparapi-> OpenCL using CPU mode of AMD driver (OpenCL CPU 
using all 6 cores)
GPU:  1234 ms   // Aparapi ->OpenCL using GPU (5770) 

So GPU won for me. My GPU is a 5770.

The bitonic sort algorithm was actually the test-case that persuaded me to add 
explicit buffer management.  The nature of the bitonic sort algorithm basically 
ends up with a tight loop executing a kernel. 

for (....){
    kernel.execute(n);
}

If you look at the aparapi patterns wiki page you will see that this is the 
pattern that suggests the use of explicit buffer management. 

So the changes that I made to your code (other than a work around for the 
bytecode->opencl bug!) were 
1) sort.setExplicit(true)
2) sort.put(theArray) before entering the loop
3) sort.get(theArray) after exiting the loop

The #'s for me now are.

SEQ:  5929 ms   // Aparapi emulating sequential code
JTP:  1855 ms   // Aparapi thread pool (I have 6 cores but aparapi only uses 4 
- power of 2)) 
CPU:  1327 ms   // Aparapi-> OpenCL using CPU mode of AMD driver (OpenCL CPU 
using all 6 cores)
GPU:   610 ms   // Aparapi ->OpenCL using GPU (5770) 

So the SEQ + JTP did not change (makes sense no opencl involved)

CPU went down a little (buffer txfer costs are NO-OPS when using OpenCL CPU) so 
I would not expect tohave had a big advantage here.

GPU was much better 2XCPU for me. 

Would you retest using the attached code (with my changes)

Original comment by frost.g...@gmail.com on 17 Dec 2011 at 5:03

Attachments:

BitonicSort.java

GoogleCodeExporter commented 9 years ago

Mr. Gary

Actual output of the initial program was this:
+------------------------------------
Initializing data...
Execution mode=GPU
retargetting 56 -> 154 to 95
retargetted 56 -> 95

 Time taken by kernel :11640 milliseconds
TEST PASSED

-----------------------------------+
As you can see the two lines extra of retargeting, it tells something about 
bug. I focused mainly on getting better results so ignored them. This was all 
running without any compilation error.

Secondly, I m still getting CPU results better than GPU, though they are almost 
equal. One reason i think is that my CPU hardware is having better performance 
specification over GPU(5470). So, nevermind GPU will eventually beat with 
better specification as yours do.

Explicit buffer really improves things a lot. And I m watching dummy else 
branches for the first time. I will do keep them in mind from now onwards. 
Anyways the results for me now are:

Array_size:     256 
SEQ:     3 ms   // Aparapi emulating sequential code
JTP:    41 ms   // Aparapi thread pool (all 4 cores)) 
CPU:   875 ms   // Aparapi-> (OpenCL using CPU all 4 cores)
GPU:   916 ms   // Aparapi ->OpenCL using GPU (5470) 

Array_size:  524288 
SEQ:   985 ms   // Aparapi emulating sequential code
JTP:   585 ms   // Aparapi thread pool (all 4 cores)) 
CPU:  1277 ms   // Aparapi-> (OpenCL using CPU all 4 cores)
GPU:  1388 ms   // Aparapi ->OpenCL using GPU (5470)

Array_size: 1048576 
SEQ:  2184 ms   // Aparapi emulating sequential code
JTP:  1091 ms   // Aparapi thread pool (all 4 cores)) 
CPU:  1724 ms   // Aparapi-> (OpenCL using CPU all 4 cores)
GPU:  1956 ms   // Aparapi ->OpenCL using GPU (5470)

Array_size:33554432 
JTP: 44123 ms   // Aparapi thread pool (all 4 cores)) 
CPU: 38419 ms   // Aparapi-> (OpenCL using CPU all 4 cores)
GPU: 49322 ms   // Aparapi ->OpenCL using GPU (5470)

So finally i m getting almost equivalent results for array having length > 256 
for GPU and CPU. 

Thank you
Vivek

Original comment by vivek.ku...@gmail.com on 17 Dec 2011 at 8:08

GoogleCodeExporter commented 9 years ago

Vivek

So it looks like your CPU has two cores.  I think you mentioned that this is an 
Intel CPU?

I am still surprised that the GPU does not do better here.  Although I have no 
experience with the 5470.  It might not be as performant as I expect.

For the smaller array sizes the cost of bytecode -> opencl is skewing the data. 
 My guess is that this conversion is ~200ms.  Another example where small 
data/compute tests OpenCL is not efficient.

Even for me the JTP mode beats the GPU until we get to 2^16 integers.

This example code has turned out to be a good test workload.  I am seeing some 
failures on the GPU (where the  assertion that array[i-1]<=array[i] is 
failing), but only occasionally, I am converting the example now to pure OpenCL 
to see if this is a
Java/OpenCL artifact, it is weird.

Can you try without the dummy else clause to see if the bytecode to OpenCL is 
indeed OK.  The two lines of debugging (need to get rid of those) that you see 
are actually from the bug fix I added to address this nested conditional bug. 
So maybe it is working for you.

BTW what version of APP_SDK are you using?

Original comment by frost.g...@gmail.com on 17 Dec 2011 at 10:10

GoogleCodeExporter commented 9 years ago

Yes cores are 2 but with 4 threads (acc. to CPU-Z)(snap attached). Even I
see 4 different CPU Usage in windows task manager.

Well i tried running bitonic sort  in Java binding JOCL (Jogamp's)( source
code provided in their sample set), it took 14 seconds for array size
2<<19, while Aparapi using GPU taking 2 seconds with explicit buffer.

Output of it JOCL goes as follows:
+-------------------------------------------------------------------------------
------------------------------------------------
Initializing OpenCL...
Initializing OpenCL bitonic sorter...
    creating bitonic sort program
    checking minimum supported workgroup size
Creating OpenCL memory objects...
4.194304
Initializing data...

Test array length 1048576 (1 arrays in the batch)...
14619ms
1, 2, 3, 4, 8, 10, 11, 11, 14, 14, 15, 15, 15, 16, 16, 16, 17, 19, 20, 22,
...; 1048556 more

TEST PASSED
--------------------------------------------------------------------------------
----------------------------------------------+

No, first dummy else clause is required otherwise those two lines are
printed. Actually the two lines are due to first dummy else clause and that
too only with CPU and GPU. i deleted second dummy else clause only and got
no bug lines in all 4 execution mode. In short first else dummy clause is
the lead.

OpenCL 1.1 AMD-APP-SDK-v2.5 (732.1)
I m attaching the output of clinfo.

Original comment by vivek.ku...@gmail.com on 18 Dec 2011 at 5:28

GoogleCodeExporter commented 9 years ago

Email Attachments are skipped here. so reuploading them

Original comment by vivek.ku...@gmail.com on 18 Dec 2011 at 5:31

Attachments:

GoogleCodeExporter commented 9 years ago

Original comment by frost.g...@gmail.com on 14 Feb 2012 at 5:31

Changed state: Accepted

GoogleCodeExporter commented 9 years ago

I think we can close this. I will re-open if anyone screams.

Original comment by frost.g...@gmail.com on 21 Feb 2012 at 3:29

Changed state: WontFix

GoogleCodeExporter commented 9 years ago

hi vivek I tested your implementation on i7 2600 and  gtx 480... with an array 
of 2^27 this gpu goes 17x times faster then cpu... Instead with i7 3610 and ati 
radeon 7670m cpu goes 2x faster than gpus... it's an hw problem :) ps: 
interesting implementation

Original comment by luigi.da...@gmail.com on 17 Jan 2014 at 9:22

rzel / aparapi

CPU still beats up GPU by 4x in biotonic sort. what could be the reason #28