Open GoogleCodeExporter opened 9 years ago
It produces at least 5-10 billion random numbers per second(writing to memory)
and 150+ billions per second when results are used in place. If I can use all
bandwidth of video memory, it will be 150+ billions per second even with
writing to memory.
Original comment by huseyin....@gmail.com
on 14 Jan 2014 at 12:28
If there is always a copy action from gpu to unmanaged main memory and
"setExplicit" instruction makes a copy from unmanaged to maanaged one, then
this is ok. Maybe adding a "not even an unmanaged copy is done auto" version of
"setExplicit" could be nice.
Original comment by huseyin....@gmail.com
on 14 Jan 2014 at 12:49
Thanks for posting this benchmark.
A few observations.
1) There is a lot of data transfer and not much compute in this kernel so it is
hard to extract all potential performance
2) Access to chars (as you correctly noted booleans map to chars in aparapi)
can be slow, due to unaligned access. You might consider using int type to
store results. Of course this will force you to transfer more data to the GPU,
but int accesses are faster.
3) Ideally the following should be faster (assuming bytecode does not create a
conditional for you)
out[iii] = resU*resU+resU2*resU2)<=((ranR-1)*(ranR-1);
Because this removes the wave divergence resulting from the conditional.
4) It might be better to find another stride pattern. At present group members
are all writing to the same cache line. Instead of using getGlobalID() directly
for each work item you might find it better to map to another stride pattern to
avoid bank/cache write conflicts
At this point an OpenCL developer would use local mem and barrier hacks to
minimize cache line contention.
We can try some of this with Aparapi, but truthfully I am not a fan of trying
to do this from Java as it creates unnecessary copies if OpenCL is unavailable.
Gary
Original comment by frost.g...@gmail.com
on 14 Jan 2014 at 1:24
Thanks Mr. Gary ,
I changed the necessary parts as you told,
Even with the halved number of total threads, integer version took 0.25 seconds
and when I used the non-branching version it only decreased to 0.24 seconds(but
still there is a gain from non-branching)
Then I changed the non-branching into a pure computation version:
result[iii]=abs((resU*resU+resU2*resU2)-((ranR-1)*(ranR-1)))/((resU*resU+resU2*r
esU2)-((ranR-1)*(ranR-1)));
Which give -1 or 0 if point is in circle and 1 if it is out of circle then I
check those from host side.
it is still 0.24 seconds.
Jumping from 0.13 seconds to 0.25 seconds is showing doubled memory access time
because I decreased the array size to half as before because java heap size is
not enough for now (quadrupling the total bytes is bad for my home computer
maybe I need to play with jvm arguments):
Basically this integer version of generator algorithm is not different than an
array sum example about memory accessing. Every thread is using its own cell
which is neighbour to other neighbor threads' cells.
How can I solve the cacheline overlapping issue? I tried using iii*4 instead of
iii but it just lagged many times more. Should I put all in the local memory
then upload the local ingredients to global memory?
Tugrul.
Original comment by huseyin....@gmail.com
on 14 Jan 2014 at 2:29
Ofcourse 150GB/s has a meaning only when it is shared by other parts of program
such as opengl, directx or mantle. In realworld it it ok, let me draw what I
understant and what I needed in a flowchart picture at the attachment. I dont
have a computer science nor any programming training so Im sorry if I mix
things.
Tugrul
Original comment by huseyin....@gmail.com
on 14 Jan 2014 at 4:25
Attachments:
To avoid cache collisions you need to try to make writes for each group go to a
different cache line.
So for each value of id {0-max} you need a function which yields a new int
0-max which is unique and > cacheline away from all others.
You should be able to use getGroupSize(), getGlobalSize and getGroupId() to
help.
Something like this seems to work.
int gid = getGlobalId(); // sequential 0,1,2 etc
int groupId = getGroupId();
int mappedGid = (gid + getGroupId() * getGroupSize())% getGlobalSize();
// use array[mappedGid] to store to
// Each gid maps to a unique mappedGid (in range 0..getGlobalSize())
// which is > groupSize away from others in the group
// assuming number of groups > groupSize.
I think ;)
here is the test code I used to come to this mapping
int size = 256; // must be a multiple of cacheline
int cacheline=64;
int groups=size/cacheline;
int[] data= new int[size];
for (int v=0; v<groups*cacheline;v++ ){
int groupId = v%cacheline;
int idx = v + groupId * cacheline;
data[idx%size]=v;
}
for (int v=0; v<size;v++ ){
System.out.println(v + " "+data[v]);
}
Original comment by frost.g...@gmail.com
on 14 Jan 2014 at 8:59
[deleted comment]
[deleted comment]
[deleted comment]
4-5 GB/s is very good for pci-e speed anyway.
When I used codexl for msvc c++(another wrapper of cl Im trying,
singlethreaded), it says 1.4GB/s for buffer transfers. When I disable profiler,
timings get better(around 2GB/s ) but nowhere near Aparapi(4.8GB/s) can do. So
does Aparapi use multithreaded copies?
Original comment by huseyin....@gmail.com
on 8 Feb 2014 at 2:18
Original issue reported on code.google.com by
huseyin....@gmail.com
on 14 Jan 2014 at 12:14