Changes

Added cl-waffe2 IR

A brand new cl-waffe2 VM makes compiling time 100x times faster by compiling networks created by defnode into cl-waffe2 IR.

A CNN compilation would be completed within 0.05 seconds:

(let ((out (ctrain (CNN) (randn `(10 3 32 32)) (randn `(10 10)))))
    (time (build out)))

Evaluation took:
  0.068 seconds of real time
  0.068376 seconds of total run time (0.068081 user, 0.000295 system)
  100.00% CPU
  706 lambdas converted
  157,765,176 processor cycles
  36,377,168 bytes consed

cl-waffe2 IR

The cl-waffe2 IR is a simple data structure of A <- f(B C D) where f is an operation and is represented by lambda functions.

CL-WAFFE2> (disassemble-waffe2-ir
        (cl-waffe2/nn:!relu (parameter (randn `(10 10)))))

== [disassemble-waffe2-ir: Forward] ======
<WfInst[Compiled: MOVETENSORNODE-CPUTENSOR] : TID7463.state <= apply( TID7463(10 10) <Param>TID7450(10 10) )>
<WfInst[Compiled: MoveTensorNode(SAVE_FOR_BACKWARD)] : TID7479.state <= apply( TID7479(10 10) TID7463(10 10) )>
<WfInst[Compiled: WHERE-OPERATION-NODE-LISPTENSOR] : TID7455.state <= apply( <Param>TID7450(10 10) TID7455(10 10) )>
<WfInst[Compiled: <DELETED>] : TID7471.state <= apply( TID7471(10 10) TID7455(10 10) )>
<WfInst[Compiled: MoveTensorNode(SAVE_FOR_BACKWARD)] : TID7487.state <= apply( TID7487(10 10) TID7471(10 10) )>
<WfInst[Compiled: MULNODE-LISPTENSOR] : TID7479.state <= apply( TID7479(10 10) TID7487(10 10) )>

== [disassemble-waffe2-ir: Backward] ======
<WfInst[Compiled: Block -> MULNODE-LISPTENSOR-BACKWARD {
        <WfInst[Compiled: MOVETENSORNODE-CPUTENSOR] : TID7590.state <= apply( TID7590(10 10) TID7587(10 10) )>
        <WfInst[Compiled: MULNODE-LISPTENSOR] : TID7590.state <= apply( TID7590(10 10) TID7471(10 10) )>
        <WfInst[Compiled: MOVETENSORNODE-CPUTENSOR] : TID7616.state <= apply( TID7616(10 10) TID7590(10 10) )>
    }
  ] : TID7528.state <= apply( TID7499(10 10) )>
<WfInst[Compiled: Block -> MOVETENSORNODE-CPUTENSOR-BACKWARD {
        <WfInst[Compiled: MOVETENSORNODE-CPUTENSOR] : TID7579.state <= apply( TID7579(10 10) TID7576(10 10) )>
    }
  ] : TID7544.state <= apply( TID7528(10 10) )>
<WfInst[Compiled: Block -> MOVETENSORNODE-CPUTENSOR-BACKWARD {
        <WfInst[Compiled: MOVETENSORNODE-CPUTENSOR] : TID7568.state <= apply( TID7568(10 10) TID7565(10 10) )>
    }
  ] : TID7552.state <= apply( TID7544(10 10) )>
<WfInst[Compiled: ADDNODE-CPUTENSOR] : TID7452.state <= apply( TID7452(10 10) TID7552(10 10) )>

cl-waffe2 has established the following step to achieve DAG-specific acceleration.

1.  [Constructing DAG networks by defnode/call/forward]
-> when called with build

2. [Applying a topological sort to the given forward/backward networks, later applying in-place mutation]

3. [Generated cl-waffe2 IR for forward/reverse mode]

4. [Compiles each nodes with using cache.lisp, to each rank/type/layout of matrices]

5. [If any, applying JITxxTensor devices] (Undone)

-> when called with proceed

2. [Applying In-place mutation within small overheads]

3. [Evaluates the computation node directly]

Parallelisation of call-with-view and FuseOps without JIT devices are future tasks. I am thinking of parallelising with lparallel instead of using OpenMP.

hikettei / cl-waffe2

Introducing cl-waffe2 IR #75

Changes

Added cl-waffe2 IR

cl-waffe2 IR