Video memory micro benchmarking and GPU random number generator

GoogleCodeExporter commented 8 years ago

This isn't even an issue but I couldn't find anywhere else to ask.

System specs:

HD7870(~153 GB/s) @ FX8150 (both fully stable &@ stock settings )

[b]What steps will reproduce the problem?[/b]
1. When doing dummy memory operations in a kernel(a dummy A[gId]+=B[gId]),  
2. Generating huge number of randomized values using all GPU cores in explicit 
get/put mode for kernel. (Over 670M random numbers + multiple iterations in 
host code)
3. Havent tried any other yet.

[b]What is the expected output? What do you see instead?[/b]

Expected: ~120GB/s Including GPU latency and kernel overhead. 

Got: ~4.8 GB/s 

Expected: 670M+ usage in Msi-afterburner hardware monitor.

Got: No increase in hardware monitor, it is around 70MB, ctrl+alt+del shows 
main system memory usage as expected.

[b]What version of the product are you using? On what operating system?[/b]
Downloaded version R#997 from 
https://code.google.com/p/aparapi/downloads/detail?name=Aparapi_2013_01_23_windo
ws_x86_64.zip&can=2&q=

Windows-7 64-bit home premium. 

Using Eclipse Kepler Release Build id: 20130614-0229 for my java-64 bit apps.

[b]Please provide any additional information below.[/b]

Kernel seemed to be using direct host pointer through pci-e even with the 
explicit option is set to true. Is there an option to make the kernel use only 
GPU memory for sub-process temporary arrays? Just like using only 
CL_MEM_WRITE|CL_MEM_READ only and nothing else like 
CL_MEM_USE_HOST_PTR,CL_MEM_ALLOC_HOST_PTR, CL_MEM_COPY_HOST_PTR, ... .

Here is my  "finding the Pi" application which uses Aparapi kernel to generate 
random numbers using a simple LCG algorithm on GPU and getting true/false 
informations about a random coordinate's state of being inside a circle or 
outside, then counting those "true/false"es to reach 3.1415 (and it does thanks 
to Aparapi)

[code]

package computePI;

import com.amd.aparapi.Kernel;
import com.amd.aparapi.Range;

public class PI_GPU {

    class KernelRnd extends Kernel
    {
         final boolean [] result= new boolean[1280*(1024*32*8*2)];
         @Override
         public void run() 
             {
             int iii= getGlobalId();
         int resU=0,resU2=0;
         int random_seed=iii; //Each generator has unique seed but next iteration is not guaranteed to be unique.
         int ranR=32768; //this must be resolution of LCG, just testing

                 random_seed = (random_seed * 1103515245 +12345); 
                 resU=(abs(random_seed / (ranR*2)) )%ranR;
             random_seed = (random_seed * 1103515245 +12345); 
                 resU2=(abs(random_seed / (ranR*2)) )%ranR;
             if((resU*resU+resU2*resU2)<=((ranR-1)*(ranR-1)))
                     {result[iii]=true;}else{result[iii]=false;} //only single memory write
             // if random point(resU,resU2) is in a circle which has radius of ranR/2 then write true to memory 

         }

         public void outPI()
         {
          long t1=0,t2=0;

          t1=System.nanoTime();
          this.get(result);
          t2=System.nanoTime(); //okay, it takes some time here to copy through pci-e.
          System.out.println(((double)(t2-t1))/1000000000.0d);

          int ctr1=0,ctr2=0;
          for(boolean fl:result)
          {
              if(fl){ctr1++;}else{ctr2++;}  
          }

           System.out.println(4.0f*(float)ctr1/(float)(ctr1+ctr2));

          }

    }

    public static void main(String[]Args) 
    {
        PI_GPU pgu=new PI_GPU();
        KernelRnd kernel=pgu.new KernelRnd();
        kernel.setExplicit(true); // I dont want it to copy anything yet, copying is in outPI() method.

        Range range = Range.create(1280*(1024*32*8*2),256); //1280 to fully utilize Pitcairn GPU, others to keep it busy
        long t1=0,t2=0;
        System.out.println(range.getLocalSize(0));    // gives 256 ----> is this for @Local annotated arrays?
        System.out.println(range.getGlobalSize(0));   // gives 671088640
        System.out.println(range.getWorkGroupSize()); // gives 256 
        System.out.println(range.getDims());          // gives 1
        System.out.println(range.getNumGroups(0));    // gives 2621440 (is there a limit for this?)
        for(int i=0;i<40;i++) // 1 warming + 39 tests
        {
            t1=System.nanoTime();
            kernel.execute(range);
            t2=System.nanoTime(); // but this takes 0.13 seconds, it could have been 0.01 second or similar
                                        //boolean is mapped as char so 1 byte per element
                                         //671088640 elements in 0.13 seconds => 671088640 bytes in 0.13 seconds => 4.8 GB/s
                                         // but I expected 0.005 seconds as a completion time so expected around ~120 GB/s
            System.out.println(((double)(t2-t1))/1000000000.0d);
        }

        kernel.outPI(); // gives 3.1415 as expected(and some other wrong digits probably my bad of using non-continuous low resolution generator)

    }

}

[/code]

Original issue reported on code.google.com by huseyin....@gmail.com on 14 Jan 2014 at 12:14

GoogleCodeExporter commented 8 years ago

It produces at least 5-10 billion random numbers per second(writing to memory) 
and 150+ billions per second when results are used in place. If I can use all 
bandwidth of video memory, it will be 150+ billions per second even with 
writing to memory.

Original comment by huseyin....@gmail.com on 14 Jan 2014 at 12:28

GoogleCodeExporter commented 8 years ago

If there is always a copy action from gpu to unmanaged main memory and 
"setExplicit" instruction makes a copy from unmanaged to maanaged one, then 
this is ok. Maybe adding a "not even an unmanaged copy is done auto" version of 
"setExplicit" could be nice.

Original comment by huseyin....@gmail.com on 14 Jan 2014 at 12:49

GoogleCodeExporter commented 8 years ago

Thanks for posting this benchmark. 

A few observations. 

1) There is a lot of data transfer and not much compute in this kernel so it is 
hard to extract all potential performance
2) Access to chars (as you correctly noted booleans map to chars in aparapi) 
can be slow, due to unaligned access. You might consider using int type to 
store results.  Of course this will force you to transfer more data to the GPU, 
but int accesses are faster.
3) Ideally the following should be faster (assuming bytecode does not create a 
conditional for you) 
       out[iii] = resU*resU+resU2*resU2)<=((ranR-1)*(ranR-1);
    Because this removes the wave divergence resulting from the conditional. 
4) It might be better to find another stride pattern. At present group members 
are all writing to the same cache line. Instead of using getGlobalID() directly 
for each work item you might find it better to map to another stride pattern to 
avoid bank/cache write conflicts

At this point an OpenCL developer would use local mem and barrier hacks to 
minimize cache line contention. 
We can try some of this with Aparapi, but truthfully I am not a fan of trying 
to do this from Java as it creates unnecessary copies if OpenCL is unavailable. 

Gary

Original comment by frost.g...@gmail.com on 14 Jan 2014 at 1:24

GoogleCodeExporter commented 8 years ago

Thanks Mr. Gary ,

I changed the necessary parts as you told,

Even with the halved number of total threads, integer version took 0.25 seconds 
and when I used the non-branching version it only decreased to 0.24 seconds(but 
still there is a gain from non-branching)

Then I changed the non-branching into a pure computation version:

result[iii]=abs((resU*resU+resU2*resU2)-((ranR-1)*(ranR-1)))/((resU*resU+resU2*r
esU2)-((ranR-1)*(ranR-1)));

Which give -1 or 0 if point is in circle and 1 if it is out of circle then I 
check those from host side.

it is still 0.24 seconds.

Jumping from 0.13 seconds to 0.25 seconds is showing doubled memory access time 
because I decreased the array size to half as before because java heap size is 
not enough for now (quadrupling the total bytes is bad for my home computer 
maybe I need to play with jvm arguments):

Basically this integer version of generator algorithm is not different than an 
array sum example about memory accessing. Every thread is using its own cell 
which is neighbour to other neighbor threads' cells.

How can I solve the cacheline overlapping issue? I tried using iii*4 instead of 
iii but it just lagged many times more. Should I put all in the local memory 
then upload the local ingredients to global memory?

Tugrul.

Original comment by huseyin....@gmail.com on 14 Jan 2014 at 2:29

GoogleCodeExporter commented 8 years ago

Ofcourse 150GB/s has a meaning only when it is shared by other parts of program 
such as opengl, directx or mantle. In realworld it it ok, let me draw what I 
understant and what I needed in a flowchart picture at the attachment. I dont 
have a computer science nor any programming training so Im sorry if I mix 
things. 

Tugrul

Original comment by huseyin....@gmail.com on 14 Jan 2014 at 4:25

Attachments:

flowChart.pdf

GoogleCodeExporter commented 8 years ago

To avoid cache collisions you need to try to make writes for each group go to a 
different cache line. 

So for each value of id {0-max} you need a function which yields a new int 
0-max which is unique and > cacheline away from all others. 

You should be able to use getGroupSize(), getGlobalSize and getGroupId() to 
help.  

Something like this seems to work. 

 int gid = getGlobalId();  // sequential 0,1,2 etc
 int groupId = getGroupId();
 int mappedGid = (gid + getGroupId() * getGroupSize())% getGlobalSize();

// use array[mappedGid] to store to  
// Each gid maps to a unique mappedGid (in range 0..getGlobalSize())
//  which is > groupSize away from others in the group
//  assuming number of groups > groupSize.

I think ;) 

here is the test code I used to come to this mapping

 int size = 256; // must be a multiple of cacheline
         int cacheline=64;
         int groups=size/cacheline;

        int[] data= new int[size];
        for (int v=0; v<groups*cacheline;v++ ){
            int groupId = v%cacheline;
            int idx = v + groupId * cacheline;
            data[idx%size]=v;
        }
        for (int v=0; v<size;v++ ){
            System.out.println(v + " "+data[v]);
        }

Original comment by frost.g...@gmail.com on 14 Jan 2014 at 8:59

GoogleCodeExporter commented 8 years ago

[deleted comment]

GoogleCodeExporter commented 8 years ago

[deleted comment]

GoogleCodeExporter commented 8 years ago

[deleted comment]

GoogleCodeExporter commented 8 years ago

4-5 GB/s is very good for pci-e speed anyway.

When I used codexl for msvc c++(another wrapper of cl Im trying, 
singlethreaded), it says 1.4GB/s for buffer transfers. When I disable profiler, 
timings get better(around 2GB/s ) but nowhere near Aparapi(4.8GB/s) can do. So 
does Aparapi use multithreaded copies?

Original comment by huseyin....@gmail.com on 8 Feb 2014 at 2:18

tigerneil / aparapi

Video memory micro benchmarking and GPU random number generator #136