Add single precision support for CPU (1.5x speed increase), GPU (16x speed increase), Fix improper missing transpose in GPU ar_irls & glm + lower memory requirements for GPU

@huppertt Please merge and test this PR when you can. Transpose was missing in the GPU version and this corrects that error. Additionally single precision support for DFE calculations shows substantial performance improvements, my testing only shows slight variations in the 5th significant digit between CPU/GPU and single/double precision calculations, but single precision on GPU is ~45x faster than on CPU with double precision

On a test dataset: CPU (M1 Max) - double precision ~97.7s per channel (dfe result 2.1183e4) CPU (M1 Max) - single precision ~54.3s per channel (dfe result 2.1182e4) GPU (RTX3080) - double precision ~31.2s per channel (dfe result 2.1183e4) GPU (RTX3080) - single precision ~1.8s per channel (dfe result 2.1184e4)

huppertt / nirs-toolbox

Add single precision support for CPU (1.5x speed increase), GPU (16x speed increase), Fix improper missing transpose in GPU ar_irls & glm + lower memory requirements for GPU #22