Is rindow_clblast used for matrix operations?

rindow / rindow-math-matrix

The fundamental package for scientific matrix operation

https://rindow.github.io/mathematics/

BSD 3-Clause "New" or "Revised" License

11 stars 3 forks source link

Is rindow_clblast used for matrix operations? #3

Open GajowyJ opened 1 year ago

GajowyJ commented 1 year ago

Hi @yuichiis , I have to say that speed improvement is impressive! I have 10x acceleration, what means that calculations done in the past for 7 days are now calculated in just a few hours. I've found that this can be even more productive, when using threads from PHP parallel extension (to be honest it's a bit surprising to me - does this mean GPU units still have some free time when using rindow extenstion?). And is was enough that only one function (matrices multiplication = cross product) was migrated. Thank you for your excellent work with parallelism in PHP! Since time is crutial for my purposes, I must resign from any unncecessary code. That's why I avoid using composer/autoloader (well, checking PHP version against 5.6 and cascade of includes and function calls each time is time-wasting). So I found minimal configuration that works:

include('MatrixOperator.php');
include('NDArray.php');
include('PhpBlas.php');
include('PhpLapack.php');
include('PhpMath.php');
include('MatrixOpelatorFunctions.php');
include('NDArrayPhp.php');
include('Buffer.php');
include('LinearBuffer.php');
include('OpenBlasBuffer.php');
include('BLAS.php');
use Rindow\Math\Matrix\MatrixOperator;
use Interop\Polite\Math\Matrix\NDArray;

And in the php.ini I added:

extension=rindow_openblas
extension=rindow_opencl

As you noticed, there is nothing about rindow_clblast . Does it means that this extension is not for matrix manipulation? If I add it, will I get extra speed up in any way?

yuichiis commented 1 year ago

Before starting this topic.... I found a bug when using float64. I just fixed it now. please update with "composer update".

Well, Sorry for the lack of documentation for developers using only the rindow-math libraries. Because so few people have ever done it.

The manual does not describe how to program to use the GPU. However, it is made so that you can easily switch to GPU when using it in rindow-neuralnetworks. There is also a manual. But you should only want to use the rindow-math libraries.

The processing speed of the composer is very fast. If that time bothers you, there's more to do.

1) About linear algebra functions and rawmode-LA

The cross() function trades speed for convenience. You should use the gemm function in linear algebra.

https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms

The la() function calls linear algebra functions. Using raw mode, which omits emulation, saves even more time overhead. Using laRawMode() without emulation eliminates even more time overhead. Use gemm() in linear algebra functions to find matrix products.

2) Consider using GPU

You haven't used the GPU yet in your source code. If you don't need to use GPU, you don't need rindow_opencl and rindow_clblast. No need to install them or write to PHP.INI

However, GPU has the effect of speeding up when handling large data. Or even if you have programmed asynchronous operations. Since the CPU and GPU work asynchronously, the more times you wait for the results of the calculations, the slower the overall processing speed will be compared to the CPU alone. Speed also depends on the performance of the GPU manufacturer's OpenCL driver.

Here is an example of how to make a CPU and GPU in blocking mode.

<?php

include('vendor/autoload.php');

use Rindow\Math\Matrix\MatrixOperator;
use Interop\Polite\Math\Matrix\NDArray;
use Interop\Polite\Math\Matrix\OpenCL;

$mo = new MatrixOperator;

//$mode = 'CPU-NORMAL';
//$mode = 'CPU-RAW';
$mode = 'GPU';
$size = 1000;
$epochs = 100;
//$dtype = NDArray::float32;
$dtype = NDArray::float64;

switch($mode) {
    case 'CPU-NORMAL': {
        $la = $mo->la();
        break;
    }
    case 'CPU-RAW': {
        $la = $mo->laRawMode();
        break;
    }
    case 'GPU': {
        $la = $mo->laAccelerated('clblast',['deviceType'=>OpenCL::CL_DEVICE_TYPE_GPU]);
        echo "blocking mode...\n";
        $la->blocking(true);
        break;
    }
    default: {
        throw new Exception('Invalid mode');
    }
}
//$la = $mo->la();

$accel = $la->accelerated()?'GPU':'CPU';
echo "Mode:$accel($mode)\n";
$fp64 = $la->fp64()?'TRUE':'FALSE';
echo "Supports float64 on this device: $fp64\n";
if($dtype==NDArray::float64 && $fp64=='FALSE') {
    $dtype = NDArray::float32;
}

if($accel=='CPU') {
    $name = $la->getBlas()->getCorename();
    echo "CPU core name :$name\n";
    $theads = $la->getBlas()->getNumThreads();
    echo "CPU theads:$theads\n";
} else {
    $i = 0;
    $devices = $la->getContext()->getInfo(OpenCL::CL_CONTEXT_DEVICES);
    $name = $devices->getInfo(0,OpenCL::CL_DEVICE_NAME);
    echo "GPU device name :$name\n";
    $cu = $devices->getInfo($i,OpenCL::CL_DEVICE_MAX_COMPUTE_UNITS);
    echo "GPU Compute units :$cu\n";
}

$strdtype = $mo->dtypetostring($dtype);
echo "data type: $strdtype\n";
echo "computing size: [$size,$size]\n";
echo "epochs: $epochs\n";

$a = $mo->arange($size*$size,dtype:$dtype)->reshape([$size,$size]);
$b = $mo->arange($size*$size,dtype:$dtype)->reshape([$size,$size]);
$a = $la->array($a);
$b = $la->array($b);

$start = microtime(true);
for($i=0;$i<$epochs;$i++) {
    $c = $la->gemm($a,$b);
}
echo "elapsed time:".(microtime(true)-$start)."\n";

3) About parallel computing

Even in CPU mode, rindow-math runs multithreaded by the openblas library.

In GPU mode, in non-blocking mode, the CPU and GPU work asynchronously, so even if the function returns from the call, the processing is not finished. Therefore, while the GPU is operating, the CPU can also operate in parallel. In order to receive the calculation result of GPU, it is necessary to wait for completion by the finish() function. If you are familiar with parallel programming, you should try this method.

Good luck!

GajowyJ commented 1 year ago

Thank you! So, the huge speed improvements so far was not related to GPU at all. Well, that's promissing.

I run your code and results are strange a bit:

Mode: CPU (CPU-NORMAL)
Supports float64 on this device: TRUE
CPU core name: Sandybridge
CPU theads: 8
data type: float64
computing size: [1000,1000]
epochs: 100
elapsed time: 2.787933

Mode: CPU (CPU-RAW)
Supports float64 on this device: TRUE
CPU core name: Sandybridge
CPU theads: 8
data type: float64
computing size: [1000,1000]
epochs: 100
elapsed time: 2.819374

blocking mode...
Mode: GPU (GPU)
Supports float64 on this device: TRUE
GPU device name: GeForce GTX 1050
GPU Compute units: 5
data type: float64
computing size: [1000,1000]
epochs: 100
elapsed time: 3.617565

Shouldn't GPU mode be faster then CPU? And I don't understand GPU Compute units number reported. My GPU card has 640 cores, not 5. Maybe this is a reason why there is no improvement in speed? Or "compute unit" is not the same as "core"?

I don't understand how you create martices ($a, $b) - I can't find anything about arrange nor reshape methods. Could you show me please any example how to convert PHP array into one accepted by the code?

Best wishes, G.

yuichiis commented 1 year ago

That's a good question.

Effective use of GPU

The GPU is faster when doing large matrix operations.

** == 1000x1000 ==

Mode:CPU(CPU-RAW)
Supports float64 on this device: TRUE
CPU core name :Sandybridge
CPU theads:4
data type: float32
computing size: [1000,1000]
epochs: 100
elapsed time:3.3692150115967

blocking mode...
Mode:GPU(GPU)
Supports float64 on this device: FALSE
GPU device name :Intel(R) HD Graphics 4000
GPU Compute units :16
data type: float32
computing size: [1000,1000]
epochs: 100
elapsed time:6.1611762046814

** == 5000x5000 ==

Mode:CPU(CPU-RAW)
Supports float64 on this device: TRUE
CPU core name :Sandybridge
CPU theads:4
data type: float32
computing size: [5000,5000]
epochs: 10
elapsed time:40.186738014221

blocking mode...
Mode:GPU(GPU)
Supports float64 on this device: FALSE
GPU device name :Intel(R) HD Graphics 4000
GPU Compute units :16
data type: float32
computing size: [5000,5000]
epochs: 10
elapsed time:27.804306983948

About N-Vidia

I don't know how N-Vidia configures Compute units.

To illustrate the concept, a less precise explanation is as follows.

Compute units are the number of threads that can run independently and in parallel. Core is the number of calculators. The cores controlled by one Compute unit can only perform the same operation at the same time.

For example, if 8 Compute units are controlling 16 cores, 128 cores will run simultaneously.

The GPU is built on a completely different concept than the CPU.

Linear Algebra Functions

See this page:

https://netlib.org/blas/

They only support floating point. Integers are not supported. Also, although single precision and double precision are originally separate functions, rindow automatically switches between functions depending on the data and calls them. Therefore the function name changes. for example;

DGEMM, SGEMM ==> gemm
DGEMV, SGEMV ==> gemv

We have selected and implemented frequently used functions from among these.

<?php

include('vendor/autoload.php');

use Rindow\Math\Matrix\MatrixOperator;
use Interop\Polite\Math\Matrix\NDArray;
use Interop\Polite\Math\Matrix\OpenCL;

$mo = new MatrixOperator();

$dtype = NDArray::float32;
//$dtype = NDArray::float64;

//// CPU
$la = $mo->la();

$a = $la->array([[1,2],[3,4]],dtype:$dtype);
$b = $la->array([[5,6],[7,8]],dtype:$dtype);
$c = $la->array([9,10],dtype:$dtype);

$y = $la->gemm($a,$b);  // y = matrix-matrix-multiply(a,b)
$z = $la->gemv($a,$c);  // z = matrix-vector-multiply(a,c)
$la->axpy($a,$b);       // b = a + b

echo "y=".$mo->toString($y)."\n";
echo "z=".$mo->toString($z)."\n";
echo "b=".$mo->toString($b)."\n";
print_r($y->toArray());

//// GPU
$la = $mo->laAccelerated('clblast',['deviceType'=>OpenCL::CL_DEVICE_TYPE_GPU]);
$la->blocking(true);

$a = $mo->la()->array([[1,2],[3,4]],dtype:$dtype);
$b = $mo->la()->array([[5,6],[7,8]],dtype:$dtype);
$c = $mo->la()->array([9,10],dtype:$dtype);
$a = $la->array($a);
$b = $la->array($b);
$c = $la->array($c);

$y = $la->gemm($a,$b);  // y = matrix-matrix-multiply(a,b)
$z = $la->gemv($a,$c);  // y = matrix-vector-multiply(a,c)
$la->axpy($a,$b);       // b = a + b

echo "y=".$mo->toString($y)."\n";
echo "z=".$mo->toString($z)."\n";
echo "b=".$mo->toString($b)."\n";

print_r($y->toArray());

Some reference manuals

Thanks, ;-)

GajowyJ commented 1 year ago

Thank you again! The example clearified a lot. I will go deeper with compute units conception with my NVidia. From other tools using GPUs (like TychoTracker) I learned that the computation speed increases sometimes 100x versus CPU.

If about the example - it helped a lot, especially the array() function. I'm still not sure if reshape will be in any use to me.

BTW, I found an issue:

For the code:

$la = $mo->laAccelerated('clblast',['deviceType'=>OpenCL::CL_DEVICE_TYPE_GPU]);
...
$a = $la->array($A,dtype:$dtype);

I experience the error: PHP Fatal error: Uncaught Error: Unknown named parameter $dtype

For $la = $mo->la(); and $la = $mo->laRawMode(); it works without any error message.

And, another issue (probably related): the computation results differs when using CPU or CPU-RAW vs. GPU.

obraz

And here are results of the same multiplication done in core PHP code:

obraz

EDIT: Indeed, when changed $a = $la->array($A,dtype:$dtype); to $a = $la->array($A,$dtype); GPU results are the same like the rest: obraz

It's because the second argument of array() function is called $flags in LinearAlgebraCL.php module.

GajowyJ commented 1 year ago

Anyway, the BLAS is a magic! I tested matrix multiplication with my real life example, and - even without any GPU support - the results are amazing:

COREPHP : 81.778 sec.
CPU-NORMAL: 1.571 sec. (1.528 + 0.020 + 0.023)
CPU-RAW: 1.521 sec. (1.478 + 0.019 + 0.023)
GPU: 1.548 sec. (1.467 + 0.058 + 0.023)

The first number in brackets is conversion time of both input arrays from PHP e.g. $a = $la->array($A,$dtype); $b = $la->array($B,$dtype); the second - calculation time, e.g. $c = $la->gemm($a,$b); and the third - time of conversion result to PHP array, e.g. $C = $c->toArray();. Do you think is there any possibility to decrease time of the first operation? Now it is about 97% of whole execution time.

yuichiis commented 1 year ago

array() syntax

Actually, I shouldn't have made array() a function with the same name on the GPU. The meaning is completely different from that of the CPU. I'm too lazy to write a lot of code, so I added the ability to write array([0,1,2]) later. However, this method is not recommended.

When computing with GPU, data is transferred as follows.

PHP world memory => CPU world memory => GPU world memory

The first copy uses $cpuLA->array(). The second copy uses $gpuLA->array().

When using GPU, be sure to write as follows. $cpuArray = $cpuLA->array($phpArray); $gpuArray = $gpuLA->array($cpuArray);

GPU calculation result

I have a very bad feeling about GPUs. The calculation results are all different depending on the GPU hardware and drivers.

The reason why PHP's internal calculations and CPU calculation results are the same is that they are calculated using the same CPU.

It may be hard to believe for those unfamiliar with scientific computing. However, we usually accept this difference in calculation results as a matter of course. We always process data with powerful algorithms that work even with this difference in calculation results.

Optimizing programs on your data

I don't think a single change in the matrix product will improve it any further.

If your entire application program is built using PHP's native arrays, you'll need to rewrite them all into a matrix math library. It will also be necessary to rewrite the program to be optimized for the GPU. There is a possibility that the effect of the GPU will appear at that time. But if your application is a computation that the GPU isn't good at, it won't help.

GajowyJ commented 1 year ago

Thank you again for your explonation. As rewriting the code is my long-term plan, for now I have to focus on the optimalizing the existing one. Probably I can manipulate on my matrices to pass them to MatrixOperator in a way that saves time on conversion.

First findings in construction of new NDArrayPhp object. There are some operations, for which I measured the execution time:

A: $dummyBuffer = new ArrayObject();
B: $this->array2Flat($array,$dummyBuffer,$idx,$prepare=true);
C: $this->_buffer = $this->newBuffer($idx,$dtype);
D: $this->array2Flat($array,$this->_buffer,$idx,$prepare=false);
E: $shape = $this->genShape($array);
F: $this->assertShape($shape);
G: $size = (int)array_product($shape);

Exec time of A = 0.0000071 sec.
Exec time of B = 0.2654539 sec.
Exec time of C = 0.0169711 sec.
Exec time of D = 0.6417275 sec.
Exec time of E = 0.0000082 sec.
Exec time of F = 0.0000044 sec.
Exec time of G = 0.0000135 sec.
Total exec time: 0.931286

So, the most costly is B & D, e.g. array2Flat function

Looking into NDArrayPhp:

    protected function array2Flat($A, $F, &$idx, $prepare)
    {
        if(is_array($A)) {
            ksort($A);
        } elseif($A instanceof ArrayObject) {
            $A->ksort();
        }

In particular cases the sorting is probably not necessary. When commented above fragment some improvement is achieved:

Exec time of A = 0.0000069 sec.
Exec time of B = 0.1097541 sec.
Exec time of C = 0.0170492 sec.
Exec time of D = 0.4596682 sec.
Exec time of E = 0.0000084 sec.
Exec time of F = 0.0000053 sec.
Exec time of G = 0.0000156 sec.
Exec time: 0.593568

yuichiis commented 1 year ago

As you may have noticed, the most time consuming part is converting between PHP arrays and CPU memory.

NDArrayPhp does the heavy lifting to easily populate a crudely constructed PHP array into CPU memory.

However, if the data structure is known, the developer can expand the memory by himself.

<?php

include('vendor/autoload.php');

use Rindow\Math\Matrix\MatrixOperator;
use Interop\Polite\Math\Matrix\NDArray;
use Interop\Polite\Math\Matrix\OpenCL;

function php2array($data,$a) {
    foreach($data as $i=>$row) {
        foreach($row as $j=>$d) {
            $a[$i][$j] = $d;
        }
    }
}

$mo = new MatrixOperator();

$dtype = NDArray::float32;
//$dtype = NDArray::float64;

$dataA = [[1,2],[3,4]];
$dataB = [[5,6],[7,8]];

$la = $mo->la();
$g_la = $mo->laAccelerated('clblast',['deviceType'=>OpenCL::CL_DEVICE_TYPE_GPU]);
$g_la->blocking(true);

$a = $la->alloc([2,2],dtype:$dtype);
$b = $la->alloc([2,2],dtype:$dtype);
php2array($dataA,$a);
php2array($dataB,$b);

$g_a = $g_la->array($a);
$g_b = $g_la->array($b);

$g_y = $g_la->gemm($g_a,$g_b);

$y = $g_la->toNDArray($g_y);

foreach($y as $row) {
    foreach($row as $d) {
        echo $d."\n";
    }
}

In other words, the rindow-math-matrix library does not assume PHP arrays as input.

In fact, when using rindow-math-matrix for machine learning, you don't use PHP arrays as input. It is used with the input data already copied to the CPU memory.

This is the design concept because an application-specific algorithm is optimal for the task of expanding application input data into the CPU memory space.

GajowyJ commented 1 year ago

Well, I have added some code, however I have some problem with PHP objects - can not find constructor of OpenBlasBuffer class (called from newBuffor method of NDArrayPhp class). May I ask for your kind support once again? new OpenBlasBuffer($size,$dtype);

I noticed that creating the buffer by adding elements one by one is slow. I'd like to test copying the whole array at once, however I feel a bit lost with class extensions, interfaces and implements :(.

yuichiis commented 1 year ago

OpenBlasBuffer.php

use Rindow\OpenBLAS\Buffer as BufferImplement;
class OpenBlasBuffer extends BufferImplement implements LinearBuffer
{....}

Rindow\OpenBLAS\Buffer is in Buffer.c

OpenBlasBuffer is written entirely in C Language.

https://rindow.github.io/mathematics/api/buffer.html

The only way to pass PHP data to the CPU memory space is via this Buffer.c.

The array2Flat function in NDArrayPhp also uses the offsetSet of this Buffer to put values one by one. So slow!

Programming in php

$a[0] = 1.0;

and

$a->offsetSet(0, 1.0)

have the same meaning.

Buffer also implemented load and dump using binary data. However, it did not work well when the floating point was converted to a string and loaded with the PHP standard function pack(). So I'm not using for floating point inputs.

https://www.php.net/manual/en/function.pack.php

It is a very big problem that the speed of converting data in the PHP world into data that can be directly handled by the CPU is very slow.

So far I haven't been able to get past this issue.

GajowyJ commented 1 year ago

Finally I found a way to decrease time of array loading from 0.809 sec to just 0.097 sec. It's based on pack() and work pretty well. As you can notice, I resigned of many controls, assuming the data are checked/prepared on early stage of the code. The main trick is to pack the whole row of the array at once (call_user_func_array costs some time as well, so I asked PHP team to consider something like vpack(string $format, array $values)). I checked the results by converting back to PHP array using toArray() method and compare each array cell.

I just copied NDArrayPhp class and added my own constructor. Probably this is not optimal way, however works ;-).

class NDArrayPhpTKL implements NDArray,Countable,Serializable,IteratorAggregate
...

public function __construct(array $arr)
{
    $buf='';
    foreach( $arr as $row )
    {
        array_unshift($row,'d*');
        $buf .= call_user_func_array('pack',$row);
    }
    $this->_buffer = $this->newBuffer(strLen($buf)/8,NDArray::float64);
    $this->_buffer->load($buf);
    $this->_offset = 0;
    $shape = $this->genShape($arr);
    $this->assertShape($shape);
    $this->_shape = $shape;
    $this->_dtype = NDArray::float64;
}

Files, which I changed/added:

MatrixOperator.php.txt

NDArrayPhpTKL.php.txt

EDIT: somebody proposed simple solution with pack, which works and is even faster:

    public function __construct(array $arr)
    {
        $buf='';
        foreach( $arr as $row )
            $buf .= pack('d*',...$row);
        $this->_buffer = $this->newBuffer(strLen($buf)/8,NDArray::float64);
        $this->_buffer->load($buf);
        $this->_offset = 0;
        $shape = $this->genShape($arr);
        $this->assertShape($shape);
        $this->_shape = $shape;
        $this->_dtype = NDArray::float64;
    }

yuichiis commented 1 year ago

wonderful! If it works for your environment, you should use that method!

Generally elegantly, you can also write, for example,

function php2array($php)
{
    $buf = '';
    foreach($php as $row)
        $buf .= pack('d*',...$row);
    return $buf;
}

$a = $la->alloc([??,??],NDArray::float64);
$a->buffer()->load(php2array($php));

No need to modify existing codes.

GajowyJ commented 1 year ago

Hi! Excellent! It can be done even easier, inside a single function:

function myArray(object $la, array $arr) : object
{
    $buf = '';
    foreach($arr as $row)
        $buf .= pack('d*',...$row);
    $a = $la->alloc([count($arr),count(reset($arr))],NDArray::float64);
    $a->buffer()->load($buf);
    return $a;
}

However this doesn't work for GPU mode:

$mo = new MatrixOperator;
$la = $mo->laAccelerated('clblast',['deviceType'=>OpenCL::CL_DEVICE_TYPE_GPU]);
$la->blocking(true);
$a = myArray($la,$A);

PHP Fatal error: Uncaught Error: Call to undefined method Rindow\Math\Matrix\OpenCLBuffer::load()

Still don't know why. I will investigate.

yuichiis commented 1 year ago

Please remember the memory space.

PHP world memory => CPU world memory => GPU world memory

$mo = new MatrixOperator();
$la = $mo->la();
$gla = $mo->laAccelerated('clblast',['deviceType'=>OpenCL::CL_DEVICE_TYPE_GPU]);
$gla->blocking(true);

$a = myArray($la,$A);
$ga = $gla->array($a);
### something ####
$y = $gla->toNDArray($gy);

GajowyJ commented 1 year ago

Hi. I'm testing different ideas for GPU, still without significant improvement versus CPU. I'm wondering about some things. Can I ask for your advice again, please?

Can I declare $la = $mo->laRawMode(); and $gla = $mo->laAccelerated('clblast',['deviceType'=>OpenCL::CL_DEVICE_TYPE_GPU]); only once and then use it for any next matrices opperations? I mean should I call it for every new pair of matrices that needs to be gemm'ed or it is enough to set (call) it once and then reuse? In the code below $la and $gla would be global variables set up in the main routine and then used each time regMultiply is called.
Should I close/free/deallocate NDArrays/NDArrayCL after processing when they are no longer needed?
Is there any control over how the library handles caluculations (number of parallel calculators, split for devices, subtasks etc.) or it is a heuristic operation, transparent for the user?

function regMultiply(array $A, array $B) : array
{
    global $_USE_BLAS,$_USE_GPU;
    if( $_USE_BLAS )
    {
        $mo = new MatrixOperator;
        $la = $mo->laRawMode();
        $A = regMyArray($la,$A);
        $B = regMyArray($la,$B);
        if( $_USE_GPU )
        {
            $gla = $mo->laAccelerated('clblast',['deviceType'=>OpenCL::CL_DEVICE_TYPE_GPU]);
            $gla->blocking(true);
            $A = $gla->array($A);
            $B = $gla->array($B);
            $C = $gla->gemm($A,$B);
            $C = $gla->toNDArray($C);
        }
        else
            $C = $la->gemm($A,$B);
        return $C->toArray();
    }
    else
    {
        $rows=count($A);
        $cols=count($B[0]);
        $m=count($A[0]);
        $C=[];
        for( $i=0; $i<$rows; $i++ )
            for( $j=0; $j<$cols; $j++ )
            {
                $C[$i][$j]=0;
                for( $r=0; $r<$m; $r++ )
                    $C[$i][$j]+=$A[$i][$r]*$B[$r][$j];
            }
        return $C;
    }
}

yuichiis commented 1 year ago

Hi ;-)

1) $mo, $la, and $gla can be used any number of times if they are created only once in the application. But $mo, $la and $gla are a set. It is assumed that they are made from the same $mo. If you make $gla a global variable, you may also need to make $mo and $la global variables. I never use global variables so I haven't tried it. 2) Automatically frees CPU and GPU memory when NDArray and NDArrayCL PHP objects are freed. There is no way to close it explicitly. If PHP objects are not freed due to "Reference memory leaks", CPU memory or GPU memory is not freed. PHP's "Collecting Cycles" are UNRELIABLE. 3) If you want parameter performance tuning, you need to tune CLBlast library. I don't know any detailed tuning information for CLBlast library, so please ask the CLBlast team. If you rebuild CLBlast library, you also need to rebuild rindow-clblast extension using your DLL. Here's how:

GajowyJ commented 1 year ago

I have discovered* that $buf = $buf.$var; is app. 23 times faster than $buf .= $var; Unbelievable! Ech... strange

And the next saving is replacing foreach($A as $row) with for( $r=0; $r<count($A); $r++ )

So that, in myArray function I can save app. 70% of execution time.

*EDIT: it depends on memory available. The code is faster if there is enough memory available for the script (for me 8GB was not enough). In opposite the second code (with ".=") works much, much faster. Just added this info, because I think it's worth to know, however maybe it's not usefull for you.