Closed hikettei closed 9 months ago
As there is an assumption that InputTensor can be filled with non-zero, an optimisation was implemented to reconnect the computation nodes to reduce the extra cache. Now, Composing several !softmax function will require one additional space:
!softmax
;; Memory-Size is 2x times smaller! CL-WAFFE2-REPL> (disassemble-waffe2-ir (!softmax (!softmax (randn `(10 10))))) disassemble-waffe2-ir: [Forward]: <WfInst[op=MOVETENSORNODE-CPUTENSOR] : TID6503 <= op(TID6503{float, (10 10)} <Input>TID6415{float, (10 10)})> <WfInst[op=VIEWTENSORNODE-T] : TID6497 <= op(TID6455{float, (10 1)} TID6497{float, (10 1)})> <WfInst[op=SCALARMUL-CPUTENSOR] : TID6497 <= op(TID6497{float, (10 1)} <Input>TID6424{float, (1)})> <WfInst[op=VIEWTENSORNODE-T] : TID6497 <= op(TID6497{float, (10 10)} TID6497{float, (10 1)})> <WfInst[op=ADDNODE-CPUTENSOR] : TID6497 <= op(TID6497{float, (10 10)} <Input>TID6415{float, (10 10)})> <WfInst[op=VIEWTENSORNODE-T] : TID6497 <= op(TID6497{float, (10 1)} TID6497{float, (10 10)})> <WfInst[op=SCALARDIV-CPUTENSOR] : TID6497 <= op(TID6497{float, (10 1)} <Input>TID6419{float, (1)})> <WfInst[op=VIEWTENSORNODE-T] : TID6497 <= op(TID6497{float, (10 10)} TID6497{float, (10 1)})> <WfInst[op=SUBNODE-CPUTENSOR] : TID6503 <= op(TID6503{float, (10 10)} TID6497{float, (10 10)})> <WfInst[op=MOVETENSORNODE-CPUTENSOR] : TID6584 <= op(TID6584{float, (10 10)} TID6503{float, (10 10)})> <WfInst[op=EXPNODE-CPUTENSOR] : TID6584 <= op(TID6503{float, (10 10)} TID6584{float, (10 10)})> <WfInst[op=SCALARMUL-CPUTENSOR] : TID6497 <= op(TID6497{float, (10 1)} <Input>TID6552{float, (1)})> <WfInst[op=VIEWTENSORNODE-T] : TID6497 <= op(TID6497{float, (10 10)} TID6497{float, (10 1)})> <WfInst[op=EXPNODE-CPUTENSOR] : TID6503 <= op(TID6503{float, (10 10)} TID6503{float, (10 10)})> <WfInst[op=ADDNODE-CPUTENSOR] : TID6497 <= op(TID6497{float, (10 10)} TID6503{float, (10 10)})> <WfInst[op=DIVNODE-CPUTENSOR] : TID6584 <= op(TID6584{float, (10 10)} TID6497{float, (10 10)})> <WfInst[op=VIEWTENSORNODE-T] : TID6497 <= op(TID6668{float, (10 1)} TID6497{float, (10 1)})> <WfInst[op=SCALARMUL-CPUTENSOR] : TID6497 <= op(TID6497{float, (10 1)} <Input>TID6637{float, (1)})> <WfInst[op=VIEWTENSORNODE-T] : TID6497 <= op(TID6497{float, (10 10)} TID6497{float, (10 1)})> <WfInst[op=ADDNODE-CPUTENSOR] : TID6497 <= op(TID6497{float, (10 10)} TID6584{float, (10 10)})> <WfInst[op=VIEWTENSORNODE-T] : TID6497 <= op(TID6497{float, (10 1)} TID6497{float, (10 10)})> <WfInst[op=SCALARDIV-CPUTENSOR] : TID6497 <= op(TID6497{float, (10 1)} <Input>TID6632{float, (1)})> <WfInst[op=VIEWTENSORNODE-T] : TID6497 <= op(TID6497{float, (10 10)} TID6497{float, (10 1)})> <WfInst[op=SUBNODE-CPUTENSOR] : TID6584 <= op(TID6584{float, (10 10)} TID6497{float, (10 10)})> <WfInst[op=MOVETENSORNODE-CPUTENSOR] : TID6503 <= op(TID6503{float, (10 10)} TID6584{float, (10 10)})> <WfInst[op=EXPNODE-CPUTENSOR] : TID6503 <= op(TID6584{float, (10 10)} TID6503{float, (10 10)})> <WfInst[op=SCALARMUL-CPUTENSOR] : TID6497 <= op(TID6497{float, (10 1)} <Input>TID6765{float, (1)})> <WfInst[op=VIEWTENSORNODE-T] : TID6497 <= op(TID6497{float, (10 10)} TID6497{float, (10 1)})> <WfInst[op=EXPNODE-CPUTENSOR] : TID6584 <= op(TID6584{float, (10 10)} TID6584{float, (10 10)})> <WfInst[op=ADDNODE-CPUTENSOR] : TID6497 <= op(TID6497{float, (10 10)} TID6584{float, (10 10)})> <WfInst[op=DIVNODE-CPUTENSOR] : TID6503 <= op(TID6503{float, (10 10)} TID6497{float, (10 10)})> 31 Instructions | 6 Tensors | 6 Scalars
VMAllocation
with-static-allocation
:asif=:node
CL-WAFFE2-REPL> (build (!sin (randn `(3 3)))) <Compiled-Composite(allocated-p=NIL) forward : forward(model) -> CPUTENSOR{FLOAT}(3 3) backward : backward(model) -> t memory-pool : one tensor(s) L {3.6e-5}MB >
CL-WAFFE2-REPL> (Adam (parameter (randn `(3 3)))) <AbstractOptimizer: ADAM( minimize : toplevel subject to : <TID6870>CPUTENSOR{FLOAT}(3 3) )> CL-WAFFE2-REPL>
Changes
Memory-Locality Optimizing
As there is an assumption that InputTensor can be filled with non-zero, an optimisation was implemented to reconnect the computation nodes to reduce the extra cache. Now, Composing several
!softmax
function will require one additional space:[Update] Static-Allocation
VMAllocation
,with-static-allocation
[Update] Thread-Safe defmodel-as
:asif=:node
won't produce additional compiling overhead, it is cached!Minor changes