hikettei / cl-waffe2

[Experimental] Graph and Tensor Abstraction for Deep Learning all in Common Lisp
https://hikettei.github.io/cl-waffe2/
MIT License
122 stars 5 forks source link

A survey of improving performance #89

Closed hikettei closed 10 months ago

hikettei commented 10 months ago

In terms of training time and memory usage, cl-waffe2 has a lot of challenges. In fact, even in the case of training simple MLP, cl-waffe2 is even 1.5 times slower than the same operations in PyTorch. However, this is because cl-waffe2 is a JIT compilation-based framework and I've only started this project a few months ago. It still has a large number of potential optimization. The next term goals are to optimize training time, So here's a list of things to be optimized:

cl-waffe2 IR

Graph-level optimization is still not enough. Especially, the number of MoveTensorNode should be reduced.

FuseOps

FuseOps Supporting is still poor. In the future, I want to create search-based instruction fusion. For example, users define the sequence of IR to be replaced with a (defpath ...) macro, and the compiler reads it.

The full use of SIMD Ops

・Use SLEEF

The full use of lparallel

Maximum speed-up can be achieved by putting all data on SIMD registers and then parallelising by lparallel.