Open calebwin opened 4 years ago
Hi @calebwin, glad to get in touch and thanks for looking into SmartCore!
In theory yes, but in practice it depends on how well methods from BaseMatrix
and BaseMatrix
map to the idea of parallelizing your logic and type constraints imposed by Emu.
Also, there is a question of a trade-off between gain in performance you get by shifting computation to GPU and the decrease, caused by copying the arrays to the GPU. From little I know about GPUs it seems to me that targeting complex methods that can be easily parallelized and that span multiple CPU cycles we have a better chance of improving performance of the library then if we implement every single method from BaseMatrix
and BaseMatrix
. Is this correct? If yes, take a look at matrix decomposition routines that I've copied from Numerical Recipes book:
For example, if you manage to improve performance of SVD and QR decomposition you automatically improve performance of linear regression and PCA methods.
In any case I like your idea. Feel free to experiment and let me know if you ran into any problems cause by API structure and method signatures. In the worst case we can always change methods from module linalg
if needed.
In theory yes, but in practice it depends on how well methods from BaseMatrix and BaseMatrix map to the idea of parallelizing your logic and type constraints imposed by Emu.
I see, functions like matmul
map easily but if functions like get
are used heavily (and it looks like they are) then the API doesn't work so well.
Also, there is a question of a trade-off between gain in performance you get by shifting computation to GPU and the decrease, caused by copying the arrays to the GPU. From little I know about GPUs it seems to me that targeting complex methods that can be easily parallelized and that span multiple CPU cycles we have a better chance of improving performance of the library then if we implement every single method from BaseMatrix and BaseMatrix. Is this correct? If yes, take a look at matrix decomposition routines that I've copied from Numerical Recipes book:
This is a valid concern but I think having DeviceDenseMatrix
, etc. that contain a DeviceBox<[f32]>
would allow data to persist on the GPU, eliminating unnecessary transfers.
For example, if you manage to improve performance of SVD and QR decomposition you automatically improve performance of linear regression and PCA methods.
Right, I think this would be the way to implement a GPU backend - rewrite algorithms so they are either specialized for Emu or utilize common functions in BaseMatrix
that have GPU-accelerated versions of them.
I also want to add that we can iteratively improve performance of the library by reducing reliance on methods like get
and set
once the initial integration with Emu is done. I was going to do it anyway after initial "expansion" phase of the project where I try to cover as many ML methods as I can in a short period of time. During this phase I am not focusing on low-level optimization while simultaneously not closing the door to improving it in the future by making bad architecture decisions. But I am open to start moving code around to improve performance now if you are willing help me with that.
can the AI developed in smartcore learn to code this backend alone ???
Hi,
This looks neat! I was curious about how hard it would be to implement a GPU-accelerated backend with Emu. Would it amount to implementing the API in
linalg
?