A survey of improving performance

In terms of training time and memory usage, cl-waffe2 has a lot of challenges. In fact, even in the case of training simple MLP, cl-waffe2 is even 1.5 times slower than the same operations in PyTorch. However, this is because cl-waffe2 is a JIT compilation-based framework and I've only started this project a few months ago. It still has a large number of potential optimization. The next term goals are to optimize training time, So here's a list of things to be optimized:

cl-waffe2 IR

Graph-level optimization is still not enough. Especially, the number of MoveTensorNode should be reduced.

FuseOps

FuseOps Supporting is still poor. In the future, I want to create search-based instruction fusion. For example, users define the sequence of IR to be replaced with a (defpath ...) macro, and the compiler reads it.

The full use of SIMD Ops

・Use SLEEF

The full use of `lparallel`

Maximum speed-up can be achieved by putting all data on SIMD registers and then parallelising by lparallel.

hikettei / cl-waffe2