tigerneil / aparapi

Automatically exported from code.google.com/p/aparapi
Other
1 stars 0 forks source link

Add support for multiple GPU's #23

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
We need Aparapi to support multiple GPU's instead of just the first GPU it 
finds available.

We have two use cases for this:

- A single workstation with multiple GPU's located on separate cards
- A cluster of computers with +2 GPU's per node

It would be ideal if we did not have to specify specific information about our 
environment and Aparapi/OpenCL would automatically partition the work and 
distribute it out as required.

I believe both CUDA +4.0 and OpenCL +1.1 support multi-threaded multi-GPU 
environments.

I've attached a small presentation related to OpenCL 1.1 multi-GPU enhancements 
given at 2011 SigGraph.

Of course, it would be nice to see an AMD presentation of this same material 
highlighting Aparapi :)

Original issue reported on code.google.com by ryan.lam...@gmail.com on 25 Nov 2011 at 1:20

Attachments:

GoogleCodeExporter commented 8 years ago
Ryan 

Thanks for the pdf. There was somes support for multi-devices in the code. In 
fact issue #18 turned out to be centered on this and i ended up backing the 
multidevice code out. The problem is dissecting the execution arbitrarily. We 
can easily execute half of the threads on each device and can schedule memory 
txfers to both devices. However sometimes this extra processing actually 
hampers performance. Some workloads run faster on one device, than if we take 
advantage of both.

Also the memory is not coherent between devices, so whether we execute on both 
or not also depends on the memory accesses.

Sometimes i think the answer is to let the user select devices and control this 
explicitly, however this seems incompatible with aparapi's "make things easier 
for the developer".

Do you have a workload that you think will benefit?

I wiuld like to work on this but worry about the oerfermance degredation.

Gary

Original comment by frost.g...@gmail.com on 25 Nov 2011 at 3:22

GoogleCodeExporter commented 8 years ago
Here are some excellent examples of how I think this could work initially:
http://developer.nvidia.com/opencl-sdk-code-samples#oclSimpleMultiGPU

My thoughts are similar to what that example code demonstrates, which is to 
give the user a number of options (my thoughts below):

- Enable single GPU by default (as is now)
- Enable multiple GPU's with two options
  - User defines specific GPU's (com.amd.aparapi.enableSpecifiedGPUs)
  - User let's Aparapi automatically uses all available GPU's (com.amd.aparapi.enableAllGPUs)

To answer your other questions:

- Very, very large datasets
- Experimentation on our high-end workstations for different workloads
- Experimentation on our mutli-node clusters for different workloads
- Allow our code to execute two different kernels on two different GPU's in 
parallel

While the last one would be difficult given a simple approach, the first three 
could potentially benefit.

Original comment by ryan.lam...@gmail.com on 27 Nov 2011 at 7:19

GoogleCodeExporter commented 8 years ago
Thanks for the NVidia link.  I had not seen their multi-gpu example before.  
Actually the NVidia example (if I am reading it correctly) partitions the data 
buffers between the available GPUs and then executes using separate command 
queues (essentially launching one kernel per device with each processing 
globalSize/#devices). 

This will work for some algorithms, but not all. Some will need us to txfer all 
of data to each device whilst executing only globalSize/#devices.

For example mandlebrot would work using the suggested NVidia example approach. 
Whereas NBody (which requires each kernel to see all data) would require the 
second. 

This is actually the problem that I could not work out how to solve. The code 
that I added prior to OpenSource release actually mapped to the NBody friendly 
approach (transfer all data to all GPUs but split the execution between 
devices).

However if the Kernel writer could tell Aparapi, maybe an annotation on the 
Kernel or maybe a multiple device version of Kernel.execute() then we might be 
able to pass on this to the developer. With the current mode as the default. 

BTW this gets real interesting with global barriers ;) and with non uniform 
group sizes (i.e when devices are different).

Gary 

Original comment by frost.g...@gmail.com on 28 Nov 2011 at 5:14

GoogleCodeExporter commented 8 years ago
A few extra use cases:
-The first OpenCL device is not necessarily the fastest one in a system, so the 
ability to choose would be beneficial in some situations
-Some people have devices with unstable drivers, in which case it would benefit 
end-users if they could tell apps to blacklist malfunctioning OpenCL devices.

Original comment by oakw...@minousoft.com on 18 Sep 2012 at 7:28

GoogleCodeExporter commented 8 years ago
Oakwhiz, 

Have you tried the new Device API for targeting a specific device?

Will this cover your needs?

You can either select a device yourself 

  Device chosen=null;
  for (Device device: devices.getAll()){
     if (device.getVendor().contains("AMD") && device.isGPU()){
        chosen = device;
        break;
     }
  }

Or use one of the shortcuts. 
   Device device = Device.best();

Then when you create a Range using the selected device, this forces your kernel 
to execute on that device. 

Device device = Device.firstGPU();
final char input[] = new char[((OpenCLDevice)device).getMaxMemory()/4);
Kernel kernel = new Kernel(){
    @Override public void run(){
      // uses input[];
    }
};
range = device.createRange2D(1024, 1024);
kernel.execute(range);

See the proposal (which is actually implemented in the trunk). 

https://code.google.com/p/aparapi/wiki/DeviceProposal

Original comment by frost.g...@gmail.com on 18 Sep 2012 at 6:35

GoogleCodeExporter commented 8 years ago
Perhaps you should edit the wiki page to make it more clear that this has been 
implemented.

Original comment by oakw...@minousoft.com on 18 Sep 2012 at 9:08

GoogleCodeExporter commented 8 years ago
Good point ;) 

I just added this. 

https://code.google.com/p/aparapi/wiki/ChoosingSpecificDevicesForExecution?ts=13
48006720&updated=ChoosingSpecificDevicesForExecution

Hopefully it will help.

Gary

Original comment by frost.g...@gmail.com on 18 Sep 2012 at 10:19