tugrul512bit / Cekirdekler

Multi-device OpenCL kernel load balancer and pipeliner API for C#. Uses shared-distributed memory model to keep GPUs updated fast while using same kernel on all devices(for simplicity).
GNU General Public License v3.0
93 stars 9 forks source link
batch-processing dynamic gpgpu gpu gpu-acceleration gpu-computing iterative load-balancer multi multi-device multi-gpu opencl opencl-kernels parallelism pipelining pool zero-copy

Cekirdekler

C# Multi-device GPGPU(OpenCL) compute API with an iterative interdevice-loadbalancing feature using multiple pipelining on read/write/compute operations for developers' custom opencl kernels. Main idea is to treat N devices as a single device when possible, taking advantage of entire platform, easily, through shared-distributed memory model under the hood.

64-bit only. "project settings -> build -> platform target -> x64" Also configuration manager needs to look like this:

Needs extra C++ dll built in 64-bit(x86_64) from https://github.com/tugrul512bit/CekirdeklerCPP which must be named KutuphaneCL.dll

The other needed dll is Microsoft's System.Threading.dll and its xml helper for .Net 2.0 - or - you can adjust "using" and use .Net 3.5+ for your own project and don't need System.Threading.dll.

In total, Cekirdekler.dll and KutuphaneCL.dll and using .Net 3.5 should be enough.

Usage: add only Cekirdekler.dll and system.threading.dll as references to your C# projects. Other files needs to exist in same folder with Cekirdekler.dll or the executable of main project.

This project is being enhanced using ZenHub:

Features


Documentation

You can see details and tutorial here in Cekirdekler-wiki


Known Issues


Example that computes 1000 workitems accross all GPUs in a PC: GPU1 computes global id range from 0 to M, GPU2 computes from M+1 to K and GPU_N computes for global id range of Y to Z

        Cekirdekler.ClNumberCruncher cr = new Cekirdekler.ClNumberCruncher(
            Cekirdekler.AcceleratorType.GPU, @"
                __kernel void hello(__global char * arr)
                {
                    int threadId=get_global_id(0);
                    printf(""hello world"");
                }
            ");

        Cekirdekler.ClArrays.ClArray<byte> array = new Cekirdekler.ClArrays.ClArray<byte>(1000);
        // Cekirdekler.ClArrays.ClArray<byte> array = new byte[1000]; // host arrays are usable too!
        array.compute(cr, 1, "hello", 1000, 100); 
        // local id range is 100 here. so this example spawns 10x workgroups and all GPUs share them like GPU1 computes 2 groups,
        // GPU2 computes 5 groups and another GPU computes 3 groups. Global id values are continuous through all global workitems,
        // local id values are also safe to use. 
        // faster GPUs get more work share over iterations. Performance aware over repeatations of a work.

        // no need to dispose anything at the end. they do it themselves when out of scope or gc.