add built-in matrix multiplication with sizes between 2x2 and 8192x8192

tugrul512bit / Cekirdekler

Multi-device OpenCL kernel load balancer and pipeliner API for C#. Uses shared-distributed memory model to keep GPUs updated fast while using same kernel on all devices(for simplicity).

GNU General Public License v3.0

93 stars 10 forks source link

add built-in matrix multiplication with sizes between 2x2 and 8192x8192 #27

Open tugrul512bit opened 7 years ago

tugrul512bit commented 7 years ago

batched 2x2 4x4 16x16 32x32 single 8k x 8k with sub-matrix partitioning to increase load balancing

N-levels of partitioning (4,16,64,256 sub matrices) or M-levels of batching