Open GajowyJ opened 1 year ago
Hi
Before starting this topic.... I found a bug when using float64. I just fixed it now. please update with "composer update".
Well, Sorry for the lack of documentation for developers using only the rindow-math libraries. Because so few people have ever done it.
The manual does not describe how to program to use the GPU. However, it is made so that you can easily switch to GPU when using it in rindow-neuralnetworks. There is also a manual. But you should only want to use the rindow-math libraries.
The processing speed of the composer is very fast. If that time bothers you, there's more to do.
1) About linear algebra functions and rawmode-LA
The cross() function trades speed for convenience. You should use the gemm function in linear algebra.
The la() function calls linear algebra functions. Using raw mode, which omits emulation, saves even more time overhead. Using laRawMode() without emulation eliminates even more time overhead. Use gemm() in linear algebra functions to find matrix products.
2) Consider using GPU
You haven't used the GPU yet in your source code. If you don't need to use GPU, you don't need rindow_opencl and rindow_clblast. No need to install them or write to PHP.INI
However, GPU has the effect of speeding up when handling large data. Or even if you have programmed asynchronous operations. Since the CPU and GPU work asynchronously, the more times you wait for the results of the calculations, the slower the overall processing speed will be compared to the CPU alone. Speed also depends on the performance of the GPU manufacturer's OpenCL driver.
Here is an example of how to make a CPU and GPU in blocking mode.
<?php
include('vendor/autoload.php');
use Rindow\Math\Matrix\MatrixOperator;
use Interop\Polite\Math\Matrix\NDArray;
use Interop\Polite\Math\Matrix\OpenCL;
$mo = new MatrixOperator;
//$mode = 'CPU-NORMAL';
//$mode = 'CPU-RAW';
$mode = 'GPU';
$size = 1000;
$epochs = 100;
//$dtype = NDArray::float32;
$dtype = NDArray::float64;
switch($mode) {
case 'CPU-NORMAL': {
$la = $mo->la();
break;
}
case 'CPU-RAW': {
$la = $mo->laRawMode();
break;
}
case 'GPU': {
$la = $mo->laAccelerated('clblast',['deviceType'=>OpenCL::CL_DEVICE_TYPE_GPU]);
echo "blocking mode...\n";
$la->blocking(true);
break;
}
default: {
throw new Exception('Invalid mode');
}
}
//$la = $mo->la();
$accel = $la->accelerated()?'GPU':'CPU';
echo "Mode:$accel($mode)\n";
$fp64 = $la->fp64()?'TRUE':'FALSE';
echo "Supports float64 on this device: $fp64\n";
if($dtype==NDArray::float64 && $fp64=='FALSE') {
$dtype = NDArray::float32;
}
if($accel=='CPU') {
$name = $la->getBlas()->getCorename();
echo "CPU core name :$name\n";
$theads = $la->getBlas()->getNumThreads();
echo "CPU theads:$theads\n";
} else {
$i = 0;
$devices = $la->getContext()->getInfo(OpenCL::CL_CONTEXT_DEVICES);
$name = $devices->getInfo(0,OpenCL::CL_DEVICE_NAME);
echo "GPU device name :$name\n";
$cu = $devices->getInfo($i,OpenCL::CL_DEVICE_MAX_COMPUTE_UNITS);
echo "GPU Compute units :$cu\n";
}
$strdtype = $mo->dtypetostring($dtype);
echo "data type: $strdtype\n";
echo "computing size: [$size,$size]\n";
echo "epochs: $epochs\n";
$a = $mo->arange($size*$size,dtype:$dtype)->reshape([$size,$size]);
$b = $mo->arange($size*$size,dtype:$dtype)->reshape([$size,$size]);
$a = $la->array($a);
$b = $la->array($b);
$start = microtime(true);
for($i=0;$i<$epochs;$i++) {
$c = $la->gemm($a,$b);
}
echo "elapsed time:".(microtime(true)-$start)."\n";
3) About parallel computing
Even in CPU mode, rindow-math runs multithreaded by the openblas library.
In GPU mode, in non-blocking mode, the CPU and GPU work asynchronously, so even if the function returns from the call, the processing is not finished. Therefore, while the GPU is operating, the CPU can also operate in parallel. In order to receive the calculation result of GPU, it is necessary to wait for completion by the finish() function. If you are familiar with parallel programming, you should try this method.
Good luck!
Thank you! So, the huge speed improvements so far was not related to GPU at all. Well, that's promissing.
I run your code and results are strange a bit:
Mode: CPU (CPU-NORMAL)
Supports float64 on this device: TRUE
CPU core name: Sandybridge
CPU theads: 8
data type: float64
computing size: [1000,1000]
epochs: 100
elapsed time: 2.787933
Mode: CPU (CPU-RAW)
Supports float64 on this device: TRUE
CPU core name: Sandybridge
CPU theads: 8
data type: float64
computing size: [1000,1000]
epochs: 100
elapsed time: 2.819374
blocking mode...
Mode: GPU (GPU)
Supports float64 on this device: TRUE
GPU device name: GeForce GTX 1050
GPU Compute units: 5
data type: float64
computing size: [1000,1000]
epochs: 100
elapsed time: 3.617565
Shouldn't GPU mode be faster then CPU? And I don't understand GPU Compute units number reported. My GPU card has 640 cores, not 5. Maybe this is a reason why there is no improvement in speed? Or "compute unit" is not the same as "core"?
I don't understand how you create martices ($a, $b) - I can't find anything about arrange nor reshape methods. Could you show me please any example how to convert PHP array into one accepted by the code?
Best wishes, G.
That's a good question.
The GPU is faster when doing large matrix operations.
** == 1000x1000 ==
Mode:CPU(CPU-RAW)
Supports float64 on this device: TRUE
CPU core name :Sandybridge
CPU theads:4
data type: float32
computing size: [1000,1000]
epochs: 100
elapsed time:3.3692150115967
blocking mode...
Mode:GPU(GPU)
Supports float64 on this device: FALSE
GPU device name :Intel(R) HD Graphics 4000
GPU Compute units :16
data type: float32
computing size: [1000,1000]
epochs: 100
elapsed time:6.1611762046814
** == 5000x5000 ==
Mode:CPU(CPU-RAW)
Supports float64 on this device: TRUE
CPU core name :Sandybridge
CPU theads:4
data type: float32
computing size: [5000,5000]
epochs: 10
elapsed time:40.186738014221
blocking mode...
Mode:GPU(GPU)
Supports float64 on this device: FALSE
GPU device name :Intel(R) HD Graphics 4000
GPU Compute units :16
data type: float32
computing size: [5000,5000]
epochs: 10
elapsed time:27.804306983948
I don't know how N-Vidia configures Compute units.
To illustrate the concept, a less precise explanation is as follows.
Compute units are the number of threads that can run independently and in parallel. Core is the number of calculators. The cores controlled by one Compute unit can only perform the same operation at the same time.
For example, if 8 Compute units are controlling 16 cores, 128 cores will run simultaneously.
The GPU is built on a completely different concept than the CPU.
See this page:
They only support floating point. Integers are not supported. Also, although single precision and double precision are originally separate functions, rindow automatically switches between functions depending on the data and calls them. Therefore the function name changes. for example;
We have selected and implemented frequently used functions from among these.
<?php
include('vendor/autoload.php');
use Rindow\Math\Matrix\MatrixOperator;
use Interop\Polite\Math\Matrix\NDArray;
use Interop\Polite\Math\Matrix\OpenCL;
$mo = new MatrixOperator();
$dtype = NDArray::float32;
//$dtype = NDArray::float64;
//// CPU
$la = $mo->la();
$a = $la->array([[1,2],[3,4]],dtype:$dtype);
$b = $la->array([[5,6],[7,8]],dtype:$dtype);
$c = $la->array([9,10],dtype:$dtype);
$y = $la->gemm($a,$b); // y = matrix-matrix-multiply(a,b)
$z = $la->gemv($a,$c); // z = matrix-vector-multiply(a,c)
$la->axpy($a,$b); // b = a + b
echo "y=".$mo->toString($y)."\n";
echo "z=".$mo->toString($z)."\n";
echo "b=".$mo->toString($b)."\n";
print_r($y->toArray());
//// GPU
$la = $mo->laAccelerated('clblast',['deviceType'=>OpenCL::CL_DEVICE_TYPE_GPU]);
$la->blocking(true);
$a = $mo->la()->array([[1,2],[3,4]],dtype:$dtype);
$b = $mo->la()->array([[5,6],[7,8]],dtype:$dtype);
$c = $mo->la()->array([9,10],dtype:$dtype);
$a = $la->array($a);
$b = $la->array($b);
$c = $la->array($c);
$y = $la->gemm($a,$b); // y = matrix-matrix-multiply(a,b)
$z = $la->gemv($a,$c); // y = matrix-vector-multiply(a,c)
$la->axpy($a,$b); // b = a + b
echo "y=".$mo->toString($y)."\n";
echo "z=".$mo->toString($z)."\n";
echo "b=".$mo->toString($b)."\n";
print_r($y->toArray());
Thanks, ;-)
Thank you again! The example clearified a lot. I will go deeper with compute units conception with my NVidia. From other tools using GPUs (like TychoTracker) I learned that the computation speed increases sometimes 100x versus CPU.
If about the example - it helped a lot, especially the array() function. I'm still not sure if reshape will be in any use to me.
BTW, I found an issue:
For the code:
$la = $mo->laAccelerated('clblast',['deviceType'=>OpenCL::CL_DEVICE_TYPE_GPU]);
...
$a = $la->array($A,dtype:$dtype);
I experience the error:
PHP Fatal error: Uncaught Error: Unknown named parameter $dtype
For $la = $mo->la();
and $la = $mo->laRawMode();
it works without any error message.
And, another issue (probably related): the computation results differs when using CPU or CPU-RAW vs. GPU.
And here are results of the same multiplication done in core PHP code:
EDIT:
Indeed, when changed $a = $la->array($A,dtype:$dtype);
to $a = $la->array($A,$dtype);
GPU results are the same like the rest:
It's because the second argument of array() function is called $flags in LinearAlgebraCL.php module.
Anyway, the BLAS is a magic! I tested matrix multiplication with my real life example, and - even without any GPU support - the results are amazing:
COREPHP : 81.778 sec.
CPU-NORMAL: 1.571 sec. (1.528 + 0.020 + 0.023)
CPU-RAW: 1.521 sec. (1.478 + 0.019 + 0.023)
GPU: 1.548 sec. (1.467 + 0.058 + 0.023)
The first number in brackets is conversion time of both input arrays from PHP e.g. $a = $la->array($A,$dtype); $b = $la->array($B,$dtype);
the second - calculation time, e.g. $c = $la->gemm($a,$b);
and the third - time of conversion result to PHP array, e.g. $C = $c->toArray();
.
Do you think is there any possibility to decrease time of the first operation? Now it is about 97% of whole execution time.
Actually, I shouldn't have made array() a function with the same name on the GPU. The meaning is completely different from that of the CPU. I'm too lazy to write a lot of code, so I added the ability to write array([0,1,2]) later. However, this method is not recommended.
When computing with GPU, data is transferred as follows.
PHP world memory => CPU world memory => GPU world memory
The first copy uses $cpuLA->array(). The second copy uses $gpuLA->array().
When using GPU, be sure to write as follows. $cpuArray = $cpuLA->array($phpArray); $gpuArray = $gpuLA->array($cpuArray);
I have a very bad feeling about GPUs. The calculation results are all different depending on the GPU hardware and drivers.
The reason why PHP's internal calculations and CPU calculation results are the same is that they are calculated using the same CPU.
It may be hard to believe for those unfamiliar with scientific computing. However, we usually accept this difference in calculation results as a matter of course. We always process data with powerful algorithms that work even with this difference in calculation results.
I don't think a single change in the matrix product will improve it any further.
If your entire application program is built using PHP's native arrays, you'll need to rewrite them all into a matrix math library. It will also be necessary to rewrite the program to be optimized for the GPU. There is a possibility that the effect of the GPU will appear at that time. But if your application is a computation that the GPU isn't good at, it won't help.
Thank you again for your explonation. As rewriting the code is my long-term plan, for now I have to focus on the optimalizing the existing one. Probably I can manipulate on my matrices to pass them to MatrixOperator in a way that saves time on conversion.
First findings in construction of new NDArrayPhp object. There are some operations, for which I measured the execution time:
A: $dummyBuffer = new ArrayObject();
B: $this->array2Flat($array,$dummyBuffer,$idx,$prepare=true);
C: $this->_buffer = $this->newBuffer($idx,$dtype);
D: $this->array2Flat($array,$this->_buffer,$idx,$prepare=false);
E: $shape = $this->genShape($array);
F: $this->assertShape($shape);
G: $size = (int)array_product($shape);
Exec time of A = 0.0000071 sec.
Exec time of B = 0.2654539 sec.
Exec time of C = 0.0169711 sec.
Exec time of D = 0.6417275 sec.
Exec time of E = 0.0000082 sec.
Exec time of F = 0.0000044 sec.
Exec time of G = 0.0000135 sec.
Total exec time: 0.931286
So, the most costly is B & D, e.g. array2Flat
function
Looking into NDArrayPhp:
protected function array2Flat($A, $F, &$idx, $prepare)
{
if(is_array($A)) {
ksort($A);
} elseif($A instanceof ArrayObject) {
$A->ksort();
}
In particular cases the sorting is probably not necessary. When commented above fragment some improvement is achieved:
Exec time of A = 0.0000069 sec.
Exec time of B = 0.1097541 sec.
Exec time of C = 0.0170492 sec.
Exec time of D = 0.4596682 sec.
Exec time of E = 0.0000084 sec.
Exec time of F = 0.0000053 sec.
Exec time of G = 0.0000156 sec.
Exec time: 0.593568
As you may have noticed, the most time consuming part is converting between PHP arrays and CPU memory.
NDArrayPhp does the heavy lifting to easily populate a crudely constructed PHP array into CPU memory.
However, if the data structure is known, the developer can expand the memory by himself.
<?php
include('vendor/autoload.php');
use Rindow\Math\Matrix\MatrixOperator;
use Interop\Polite\Math\Matrix\NDArray;
use Interop\Polite\Math\Matrix\OpenCL;
function php2array($data,$a) {
foreach($data as $i=>$row) {
foreach($row as $j=>$d) {
$a[$i][$j] = $d;
}
}
}
$mo = new MatrixOperator();
$dtype = NDArray::float32;
//$dtype = NDArray::float64;
$dataA = [[1,2],[3,4]];
$dataB = [[5,6],[7,8]];
$la = $mo->la();
$g_la = $mo->laAccelerated('clblast',['deviceType'=>OpenCL::CL_DEVICE_TYPE_GPU]);
$g_la->blocking(true);
$a = $la->alloc([2,2],dtype:$dtype);
$b = $la->alloc([2,2],dtype:$dtype);
php2array($dataA,$a);
php2array($dataB,$b);
$g_a = $g_la->array($a);
$g_b = $g_la->array($b);
$g_y = $g_la->gemm($g_a,$g_b);
$y = $g_la->toNDArray($g_y);
foreach($y as $row) {
foreach($row as $d) {
echo $d."\n";
}
}
In other words, the rindow-math-matrix library does not assume PHP arrays as input.
In fact, when using rindow-math-matrix for machine learning, you don't use PHP arrays as input. It is used with the input data already copied to the CPU memory.
This is the design concept because an application-specific algorithm is optimal for the task of expanding application input data into the CPU memory space.
Well, I have added some code, however I have some problem with PHP objects - can not find constructor of OpenBlasBuffer class (called from newBuffor method of NDArrayPhp class). May I ask for your kind support once again?
new OpenBlasBuffer($size,$dtype);
I noticed that creating the buffer by adding elements one by one is slow. I'd like to test copying the whole array at once, however I feel a bit lost with class extensions, interfaces and implements :(.
OpenBlasBuffer.php
use Rindow\OpenBLAS\Buffer as BufferImplement;
class OpenBlasBuffer extends BufferImplement implements LinearBuffer
{....}
Rindow\OpenBLAS\Buffer is in Buffer.c
OpenBlasBuffer is written entirely in C Language.
The only way to pass PHP data to the CPU memory space is via this Buffer.c.
The array2Flat function in NDArrayPhp also uses the offsetSet of this Buffer to put values one by one. So slow!
Programming in php
$a[0] = 1.0;
and
$a->offsetSet(0, 1.0)
have the same meaning.
Buffer also implemented load and dump using binary data. However, it did not work well when the floating point was converted to a string and loaded with the PHP standard function pack(). So I'm not using for floating point inputs.
It is a very big problem that the speed of converting data in the PHP world into data that can be directly handled by the CPU is very slow.
So far I haven't been able to get past this issue.
Finally I found a way to decrease time of array loading from 0.809 sec to just 0.097 sec. It's based on pack() and work pretty well. As you can notice, I resigned of many controls, assuming the data are checked/prepared on early stage of the code. The main trick is to pack the whole row of the array at once (call_user_func_array costs some time as well, so I asked PHP team to consider something like vpack(string $format, array $values)
). I checked the results by converting back to PHP array using toArray() method and compare each array cell.
I just copied NDArrayPhp class and added my own constructor. Probably this is not optimal way, however works ;-).
class NDArrayPhpTKL implements NDArray,Countable,Serializable,IteratorAggregate
...
public function __construct(array $arr)
{
$buf='';
foreach( $arr as $row )
{
array_unshift($row,'d*');
$buf .= call_user_func_array('pack',$row);
}
$this->_buffer = $this->newBuffer(strLen($buf)/8,NDArray::float64);
$this->_buffer->load($buf);
$this->_offset = 0;
$shape = $this->genShape($arr);
$this->assertShape($shape);
$this->_shape = $shape;
$this->_dtype = NDArray::float64;
}
Files, which I changed/added:
EDIT: somebody proposed simple solution with pack, which works and is even faster:
public function __construct(array $arr)
{
$buf='';
foreach( $arr as $row )
$buf .= pack('d*',...$row);
$this->_buffer = $this->newBuffer(strLen($buf)/8,NDArray::float64);
$this->_buffer->load($buf);
$this->_offset = 0;
$shape = $this->genShape($arr);
$this->assertShape($shape);
$this->_shape = $shape;
$this->_dtype = NDArray::float64;
}
wonderful! If it works for your environment, you should use that method!
Generally elegantly, you can also write, for example,
function php2array($php)
{
$buf = '';
foreach($php as $row)
$buf .= pack('d*',...$row);
return $buf;
}
$a = $la->alloc([??,??],NDArray::float64);
$a->buffer()->load(php2array($php));
No need to modify existing codes.
Hi! Excellent! It can be done even easier, inside a single function:
function myArray(object $la, array $arr) : object
{
$buf = '';
foreach($arr as $row)
$buf .= pack('d*',...$row);
$a = $la->alloc([count($arr),count(reset($arr))],NDArray::float64);
$a->buffer()->load($buf);
return $a;
}
However this doesn't work for GPU mode:
$mo = new MatrixOperator;
$la = $mo->laAccelerated('clblast',['deviceType'=>OpenCL::CL_DEVICE_TYPE_GPU]);
$la->blocking(true);
$a = myArray($la,$A);
PHP Fatal error: Uncaught Error: Call to undefined method Rindow\Math\Matrix\OpenCLBuffer::load()
Still don't know why. I will investigate.
Please remember the memory space.
PHP world memory => CPU world memory => GPU world memory
$mo = new MatrixOperator();
$la = $mo->la();
$gla = $mo->laAccelerated('clblast',['deviceType'=>OpenCL::CL_DEVICE_TYPE_GPU]);
$gla->blocking(true);
$a = myArray($la,$A);
$ga = $gla->array($a);
### something ####
$y = $gla->toNDArray($gy);
Hi. I'm testing different ideas for GPU, still without significant improvement versus CPU. I'm wondering about some things. Can I ask for your advice again, please?
$la = $mo->laRawMode();
and $gla = $mo->laAccelerated('clblast',['deviceType'=>OpenCL::CL_DEVICE_TYPE_GPU]);
only once and then use it for any next matrices opperations? I mean should I call it for every new pair of matrices that needs to be gemm'ed or it is enough to set (call) it once and then reuse? In the code below $la and $gla would be global variables set up in the main routine and then used each time regMultiply is called.function regMultiply(array $A, array $B) : array
{
global $_USE_BLAS,$_USE_GPU;
if( $_USE_BLAS )
{
$mo = new MatrixOperator;
$la = $mo->laRawMode();
$A = regMyArray($la,$A);
$B = regMyArray($la,$B);
if( $_USE_GPU )
{
$gla = $mo->laAccelerated('clblast',['deviceType'=>OpenCL::CL_DEVICE_TYPE_GPU]);
$gla->blocking(true);
$A = $gla->array($A);
$B = $gla->array($B);
$C = $gla->gemm($A,$B);
$C = $gla->toNDArray($C);
}
else
$C = $la->gemm($A,$B);
return $C->toArray();
}
else
{
$rows=count($A);
$cols=count($B[0]);
$m=count($A[0]);
$C=[];
for( $i=0; $i<$rows; $i++ )
for( $j=0; $j<$cols; $j++ )
{
$C[$i][$j]=0;
for( $r=0; $r<$m; $r++ )
$C[$i][$j]+=$A[$i][$r]*$B[$r][$j];
}
return $C;
}
}
Hi ;-)
1) $mo, $la, and $gla can be used any number of times if they are created only once in the application. But $mo, $la and $gla are a set. It is assumed that they are made from the same $mo. If you make $gla a global variable, you may also need to make $mo and $la global variables. I never use global variables so I haven't tried it. 2) Automatically frees CPU and GPU memory when NDArray and NDArrayCL PHP objects are freed. There is no way to close it explicitly. If PHP objects are not freed due to "Reference memory leaks", CPU memory or GPU memory is not freed. PHP's "Collecting Cycles" are UNRELIABLE. 3) If you want parameter performance tuning, you need to tune CLBlast library. I don't know any detailed tuning information for CLBlast library, so please ask the CLBlast team. If you rebuild CLBlast library, you also need to rebuild rindow-clblast extension using your DLL. Here's how:
I have discovered* that
$buf = $buf.$var;
is app. 23 times faster than
$buf .= $var;
Unbelievable!
Ech... strange
And the next saving is replacing foreach($A as $row)
with for( $r=0; $r<count($A); $r++ )
So that, in myArray function I can save app. 70% of execution time.
*EDIT: it depends on memory available. The code is faster if there is enough memory available for the script (for me 8GB was not enough). In opposite the second code (with ".=") works much, much faster. Just added this info, because I think it's worth to know, however maybe it's not usefull for you.
Hi @yuichiis , I have to say that speed improvement is impressive! I have 10x acceleration, what means that calculations done in the past for 7 days are now calculated in just a few hours. I've found that this can be even more productive, when using threads from PHP parallel extension (to be honest it's a bit surprising to me - does this mean GPU units still have some free time when using rindow extenstion?). And is was enough that only one function (matrices multiplication = cross product) was migrated. Thank you for your excellent work with parallelism in PHP! Since time is crutial for my purposes, I must resign from any unncecessary code. That's why I avoid using composer/autoloader (well, checking PHP version against 5.6 and cascade of includes and function calls each time is time-wasting). So I found minimal configuration that works:
And in the php.ini I added:
As you noticed, there is nothing about
rindow_clblast
. Does it means that this extension is not for matrix manipulation? If I add it, will I get extra speed up in any way?