Closed GoogleCodeExporter closed 9 years ago
Vivek
Thanks for the code I think it will spawn a couple more bug reports! ;)
First your code exposed a bytecode to OpenCL conversion error for me (how did
you run it?). That was sad but I will add the pattern to the JUNIT test suite
and see if I can see what is going on. javac (Oracle) optimizes back branches
if nested conditionals do not contain elses.
So
if (cond1){
if (cond2){
...
}
}else{
if (cond3){
...
}
}
I had seen this previously and *thought* I had fixed it, clearly not.
To workaround this I just added dummy else branches to your run() method.
Something like
if (cond1){
if (cond2){
...
}else{
temp=temp;
}
}else{
if (cond3){
...
}else{
temp=temp;
}
}
Oh and I had to initialize temp to 0;
Now the code will run using OpenCL on CPU and GPU ;)
Sadly when array length was > 2^20 (1048576 ints) the OpenCL version of the
code was 'failing' your sanity test. Not sure why this would be. I need to dig
into this. Anyway for the time being I set array length to 2^20.
These are the #'2 I got after making these changes:
Arraylength = 1048576
SEQ: 5952 ms // Aparapi emulating sequential code
JTP: 1869 ms // Aparapi thread pool (I have 6 cores but aparapi only uses 4
- power of 2))
CPU: 1571 ms // Aparapi-> OpenCL using CPU mode of AMD driver (OpenCL CPU
using all 6 cores)
GPU: 1234 ms // Aparapi ->OpenCL using GPU (5770)
So GPU won for me. My GPU is a 5770.
The bitonic sort algorithm was actually the test-case that persuaded me to add
explicit buffer management. The nature of the bitonic sort algorithm basically
ends up with a tight loop executing a kernel.
for (....){
kernel.execute(n);
}
If you look at the aparapi patterns wiki page you will see that this is the
pattern that suggests the use of explicit buffer management.
So the changes that I made to your code (other than a work around for the
bytecode->opencl bug!) were
1) sort.setExplicit(true)
2) sort.put(theArray) before entering the loop
3) sort.get(theArray) after exiting the loop
The #'s for me now are.
SEQ: 5929 ms // Aparapi emulating sequential code
JTP: 1855 ms // Aparapi thread pool (I have 6 cores but aparapi only uses 4
- power of 2))
CPU: 1327 ms // Aparapi-> OpenCL using CPU mode of AMD driver (OpenCL CPU
using all 6 cores)
GPU: 610 ms // Aparapi ->OpenCL using GPU (5770)
So the SEQ + JTP did not change (makes sense no opencl involved)
CPU went down a little (buffer txfer costs are NO-OPS when using OpenCL CPU) so
I would not expect tohave had a big advantage here.
GPU was much better 2XCPU for me.
Would you retest using the attached code (with my changes)
Original comment by frost.g...@gmail.com
on 17 Dec 2011 at 5:03
Attachments:
Mr. Gary
Actual output of the initial program was this:
+------------------------------------
Initializing data...
Execution mode=GPU
retargetting 56 -> 154 to 95
retargetted 56 -> 95
Time taken by kernel :11640 milliseconds
TEST PASSED
-----------------------------------+
As you can see the two lines extra of retargeting, it tells something about
bug. I focused mainly on getting better results so ignored them. This was all
running without any compilation error.
Secondly, I m still getting CPU results better than GPU, though they are almost
equal. One reason i think is that my CPU hardware is having better performance
specification over GPU(5470). So, nevermind GPU will eventually beat with
better specification as yours do.
Explicit buffer really improves things a lot. And I m watching dummy else
branches for the first time. I will do keep them in mind from now onwards.
Anyways the results for me now are:
Array_size: 256
SEQ: 3 ms // Aparapi emulating sequential code
JTP: 41 ms // Aparapi thread pool (all 4 cores))
CPU: 875 ms // Aparapi-> (OpenCL using CPU all 4 cores)
GPU: 916 ms // Aparapi ->OpenCL using GPU (5470)
Array_size: 524288
SEQ: 985 ms // Aparapi emulating sequential code
JTP: 585 ms // Aparapi thread pool (all 4 cores))
CPU: 1277 ms // Aparapi-> (OpenCL using CPU all 4 cores)
GPU: 1388 ms // Aparapi ->OpenCL using GPU (5470)
Array_size: 1048576
SEQ: 2184 ms // Aparapi emulating sequential code
JTP: 1091 ms // Aparapi thread pool (all 4 cores))
CPU: 1724 ms // Aparapi-> (OpenCL using CPU all 4 cores)
GPU: 1956 ms // Aparapi ->OpenCL using GPU (5470)
Array_size:33554432
JTP: 44123 ms // Aparapi thread pool (all 4 cores))
CPU: 38419 ms // Aparapi-> (OpenCL using CPU all 4 cores)
GPU: 49322 ms // Aparapi ->OpenCL using GPU (5470)
So finally i m getting almost equivalent results for array having length > 256
for GPU and CPU.
Thank you
Vivek
Original comment by vivek.ku...@gmail.com
on 17 Dec 2011 at 8:08
Vivek
So it looks like your CPU has two cores. I think you mentioned that this is an
Intel CPU?
I am still surprised that the GPU does not do better here. Although I have no
experience with the 5470. It might not be as performant as I expect.
For the smaller array sizes the cost of bytecode -> opencl is skewing the data.
My guess is that this conversion is ~200ms. Another example where small
data/compute tests OpenCL is not efficient.
Even for me the JTP mode beats the GPU until we get to 2^16 integers.
This example code has turned out to be a good test workload. I am seeing some
failures on the GPU (where the assertion that array[i-1]<=array[i] is
failing), but only occasionally, I am converting the example now to pure OpenCL
to see if this is a
Java/OpenCL artifact, it is weird.
Can you try without the dummy else clause to see if the bytecode to OpenCL is
indeed OK. The two lines of debugging (need to get rid of those) that you see
are actually from the bug fix I added to address this nested conditional bug.
So maybe it is working for you.
BTW what version of APP_SDK are you using?
Original comment by frost.g...@gmail.com
on 17 Dec 2011 at 10:10
Yes cores are 2 but with 4 threads (acc. to CPU-Z)(snap attached). Even I
see 4 different CPU Usage in windows task manager.
Well i tried running bitonic sort in Java binding JOCL (Jogamp's)( source
code provided in their sample set), it took 14 seconds for array size
2<<19, while Aparapi using GPU taking 2 seconds with explicit buffer.
Output of it JOCL goes as follows:
+-------------------------------------------------------------------------------
------------------------------------------------
Initializing OpenCL...
Initializing OpenCL bitonic sorter...
creating bitonic sort program
checking minimum supported workgroup size
Creating OpenCL memory objects...
4.194304
Initializing data...
Test array length 1048576 (1 arrays in the batch)...
14619ms
1, 2, 3, 4, 8, 10, 11, 11, 14, 14, 15, 15, 15, 16, 16, 16, 17, 19, 20, 22,
...; 1048556 more
TEST PASSED
--------------------------------------------------------------------------------
----------------------------------------------+
No, first dummy else clause is required otherwise those two lines are
printed. Actually the two lines are due to first dummy else clause and that
too only with CPU and GPU. i deleted second dummy else clause only and got
no bug lines in all 4 execution mode. In short first else dummy clause is
the lead.
OpenCL 1.1 AMD-APP-SDK-v2.5 (732.1)
I m attaching the output of clinfo.
Original comment by vivek.ku...@gmail.com
on 18 Dec 2011 at 5:28
Email Attachments are skipped here. so reuploading them
Original comment by vivek.ku...@gmail.com
on 18 Dec 2011 at 5:31
Attachments:
Original comment by frost.g...@gmail.com
on 14 Feb 2012 at 5:31
I think we can close this. I will re-open if anyone screams.
Original comment by frost.g...@gmail.com
on 21 Feb 2012 at 3:29
hi vivek I tested your implementation on i7 2600 and gtx 480... with an array
of 2^27 this gpu goes 17x times faster then cpu... Instead with i7 3610 and ati
radeon 7670m cpu goes 2x faster than gpus... it's an hw problem :) ps:
interesting implementation
Original comment by luigi.da...@gmail.com
on 17 Jan 2014 at 9:22
Original issue reported on code.google.com by
vivek.ku...@gmail.com
on 17 Dec 2011 at 2:31Attachments: