Open GoogleCodeExporter opened 8 years ago
Ryan
Thanks for the pdf. There was somes support for multi-devices in the code. In
fact issue #18 turned out to be centered on this and i ended up backing the
multidevice code out. The problem is dissecting the execution arbitrarily. We
can easily execute half of the threads on each device and can schedule memory
txfers to both devices. However sometimes this extra processing actually
hampers performance. Some workloads run faster on one device, than if we take
advantage of both.
Also the memory is not coherent between devices, so whether we execute on both
or not also depends on the memory accesses.
Sometimes i think the answer is to let the user select devices and control this
explicitly, however this seems incompatible with aparapi's "make things easier
for the developer".
Do you have a workload that you think will benefit?
I wiuld like to work on this but worry about the oerfermance degredation.
Gary
Original comment by frost.g...@gmail.com
on 25 Nov 2011 at 3:22
Here are some excellent examples of how I think this could work initially:
http://developer.nvidia.com/opencl-sdk-code-samples#oclSimpleMultiGPU
My thoughts are similar to what that example code demonstrates, which is to
give the user a number of options (my thoughts below):
- Enable single GPU by default (as is now)
- Enable multiple GPU's with two options
- User defines specific GPU's (com.amd.aparapi.enableSpecifiedGPUs)
- User let's Aparapi automatically uses all available GPU's (com.amd.aparapi.enableAllGPUs)
To answer your other questions:
- Very, very large datasets
- Experimentation on our high-end workstations for different workloads
- Experimentation on our mutli-node clusters for different workloads
- Allow our code to execute two different kernels on two different GPU's in
parallel
While the last one would be difficult given a simple approach, the first three
could potentially benefit.
Original comment by ryan.lam...@gmail.com
on 27 Nov 2011 at 7:19
Thanks for the NVidia link. I had not seen their multi-gpu example before.
Actually the NVidia example (if I am reading it correctly) partitions the data
buffers between the available GPUs and then executes using separate command
queues (essentially launching one kernel per device with each processing
globalSize/#devices).
This will work for some algorithms, but not all. Some will need us to txfer all
of data to each device whilst executing only globalSize/#devices.
For example mandlebrot would work using the suggested NVidia example approach.
Whereas NBody (which requires each kernel to see all data) would require the
second.
This is actually the problem that I could not work out how to solve. The code
that I added prior to OpenSource release actually mapped to the NBody friendly
approach (transfer all data to all GPUs but split the execution between
devices).
However if the Kernel writer could tell Aparapi, maybe an annotation on the
Kernel or maybe a multiple device version of Kernel.execute() then we might be
able to pass on this to the developer. With the current mode as the default.
BTW this gets real interesting with global barriers ;) and with non uniform
group sizes (i.e when devices are different).
Gary
Original comment by frost.g...@gmail.com
on 28 Nov 2011 at 5:14
A few extra use cases:
-The first OpenCL device is not necessarily the fastest one in a system, so the
ability to choose would be beneficial in some situations
-Some people have devices with unstable drivers, in which case it would benefit
end-users if they could tell apps to blacklist malfunctioning OpenCL devices.
Original comment by oakw...@minousoft.com
on 18 Sep 2012 at 7:28
Oakwhiz,
Have you tried the new Device API for targeting a specific device?
Will this cover your needs?
You can either select a device yourself
Device chosen=null;
for (Device device: devices.getAll()){
if (device.getVendor().contains("AMD") && device.isGPU()){
chosen = device;
break;
}
}
Or use one of the shortcuts.
Device device = Device.best();
Then when you create a Range using the selected device, this forces your kernel
to execute on that device.
Device device = Device.firstGPU();
final char input[] = new char[((OpenCLDevice)device).getMaxMemory()/4);
Kernel kernel = new Kernel(){
@Override public void run(){
// uses input[];
}
};
range = device.createRange2D(1024, 1024);
kernel.execute(range);
See the proposal (which is actually implemented in the trunk).
https://code.google.com/p/aparapi/wiki/DeviceProposal
Original comment by frost.g...@gmail.com
on 18 Sep 2012 at 6:35
Perhaps you should edit the wiki page to make it more clear that this has been
implemented.
Original comment by oakw...@minousoft.com
on 18 Sep 2012 at 9:08
Good point ;)
I just added this.
https://code.google.com/p/aparapi/wiki/ChoosingSpecificDevicesForExecution?ts=13
48006720&updated=ChoosingSpecificDevicesForExecution
Hopefully it will help.
Gary
Original comment by frost.g...@gmail.com
on 18 Sep 2012 at 10:19
Original issue reported on code.google.com by
ryan.lam...@gmail.com
on 25 Nov 2011 at 1:20Attachments: