In terms of training time and memory usage, cl-waffe2 has a lot of challenges. In fact, even in the case of training simple MLP, cl-waffe2 is even 1.5 times slower than the same operations in PyTorch. However, this is because cl-waffe2 is a JIT compilation-based framework and I've only started this project a few months ago. It still has a large number of potential optimization. The next term goals are to optimize training time, So here's a list of things to be optimized:
cl-waffe2 IR
Graph-level optimization is still not enough. Especially, the number of MoveTensorNode should be reduced.
FuseOps
FuseOps Supporting is still poor. In the future, I want to create search-based instruction fusion. For example, users define the sequence of IR to be replaced with a (defpath ...) macro, and the compiler reads it.
The full use of SIMD Ops
・Use SLEEF
The full use of lparallel
Maximum speed-up can be achieved by putting all data on SIMD registers and then parallelising by lparallel.
In terms of training time and memory usage, cl-waffe2 has a lot of challenges. In fact, even in the case of training simple MLP, cl-waffe2 is even 1.5 times slower than the same operations in
PyTorch
. However, this is because cl-waffe2 is a JIT compilation-based framework and I've only started this project a few months ago. It still has a large number of potential optimization. The next term goals are to optimize training time, So here's a list of things to be optimized:cl-waffe2 IR
Graph-level optimization is still not enough. Especially, the number of
MoveTensorNode
should be reduced.FuseOps
FuseOps
Supporting is still poor. In the future, I want to create search-based instruction fusion. For example, users define the sequence of IR to be replaced with a(defpath ...)
macro, and the compiler reads it.The full use of SIMD Ops
・Use SLEEF
The full use of
lparallel
Maximum speed-up can be achieved by putting all data on SIMD registers and then parallelising by lparallel.