hikettei / cl-waffe2

[Experimental] Graph and Tensor Abstraction for Deep Learning all in Common Lisp
https://hikettei.github.io/cl-waffe2/
MIT License
122 stars 5 forks source link

[Enhancement] Eazy to know the bottleneck #93

Closed hikettei closed 10 months ago

hikettei commented 10 months ago

Introducing proceed-bench

proceed-bench is a strong function to profile the computation node to know where is the bottleneck.

CL-WAFFE2-REPL> (proceed-bench (!mean (call (Conv2D 3 5 `(2 2)) (randn `(10 5 10 10)))))
 Time(s)   |   Instruction ( * - Beyonds the average execution time)
6.0e-4     | <WfInst[Compiled: VIEWTENSORNODE-T]                  : TID1413118 <= op(TID1413118(1 1 1 1) TID1413116(1 1 1 1))>
0.001103   | <WfInst[Compiled: SCALARMUL-CPUTENSOR]               : TID1412997 <= op(TID1412997(1 1 1 1) <Input>TID1412999(1))>
6.2e-4     | <WfInst[Compiled: VIEWTENSORNODE-T]                  : TID1413008 <= op(TID1413008(1 3 4 4) TID1412997(1 1 1 1))>
0.08184*   | <WfInst[Compiled: IM2COLNODE-LISPTENSOR]             : TID1412836 <= op(<Param>TID1412820(1 3 16 16) <Input>TID1412836(1 3 4 4 4 4))>
0.004949   | <WfInst[Compiled: PERMUTE-NODE-T]                    : TID1412847 <= op(<Input>TID1412836(1 3 4 4 4 4) <Input>TID1412847(1 4 4 3 4 4))>
0.083629*  | <WfInst[Compiled: MOVETENSORNODE-CPUTENSOR]          : TID1412856 <= op(TID1412856(1 4 4 3 4 4) <Input>TID1412847(1 4 4 3 4 4))>
0.00163    | <WfInst[Compiled: MoveTensorNode(SAVE_FOR_BACKWARD)] : TID1412896 <= op(TID1412896(1 4 4 3 4 4) TID1412856(1 4 4 3 4 4))>
6.42e-4    | <WfInst[Compiled: RESHAPETENSORNODE-T]               : TID1412853 <= op(TID1412896(1 4 4 3 4 4) TID1412853(16 48))>
0.001543   | <WfInst[Compiled: MoveTensorNode(SAVE_FOR_BACKWARD)] : TID1412913 <= op(TID1412913(16 48) TID1412853(16 48))>
5.84e-4    | <WfInst[Compiled: RESHAPETENSORNODE-T]               : TID1412910 <= op(TID1412913(16 48) TID1412910(48 16))>
0.001425   | <WfInst[Compiled: MoveTensorNode(SAVE_FOR_BACKWARD)] : TID1412930 <= op(TID1412930(48 16) TID1412910(48 16))>
6.81e-4    | <WfInst[Compiled: RESHAPETENSORNODE-T]               : TID1412927 <= op(TID1412930(48 16) TID1412927(48 16))>
0.00113    | <WfInst[Compiled: MoveTensorNode(SAVE_FOR_BACKWARD)] : TID1412947 <= op(TID1412947(48 16) TID1412927(48 16))>
0.019396*  | <WfInst[Compiled: MAXVALUE-NODE-CPUTENSOR]           : TID1412944 <= op(TID1412947(48 16) TID1412944(48 1))>
0.001512   | <WfInst[Compiled: MoveTensorNode(SAVE_FOR_BACKWARD)] : TID1412978 <= op(TID1412978(48 1) TID1412944(48 1))>
6.04e-4    | <WfInst[Compiled: RESHAPETENSORNODE-T]               : TID1412975 <= op(TID1412978(48 1) TID1412975(1 4 4 3))>
0.005376   | <WfInst[Compiled: PERMUTE-NODE-T]                    : TID1412991 <= op(TID1412975(1 4 4 3) TID1412991(1 3 4 4))>
0.007209*  | <WfInst[Compiled: ADDNODE-CPUTENSOR]                 : TID1413008 <= op(TID1413008(1 3 4 4) TID1412991(1 3 4 4))>
6.42e-4    | <WfInst[Compiled: VIEWTENSORNODE-T]                  : TID1413040 <= op(TID1413040(1 1 1 1) TID1413008(1 3 4 4))>
0.002236   | <WfInst[Compiled: MOVETENSORNODE-CPUTENSOR]          : TID1413118 <= op(TID1413118(1 1 1 1) TID1413040(1 1 1 1))>
0.001326   | <WfInst[Compiled: MoveTensorNode(SAVE_FOR_BACKWARD)] : TID1413159 <= op(TID1413159(1 1 1 1) TID1413118(1 1 1 1))>
5.97e-4    | <WfInst[Compiled: MOVESCALARTENSORNODE-SCALARTENSOR] : TID1413050 <= op(TID1413050(1) <Input>TID1413045(1))>
3.3e-4     | <WfInst[Compiled: MOVESCALARTENSORNODE-SCALARTENSOR] : TID1413056 <= op(TID1413056(1) <Input>TID1413047(1))>
8.02e-4    | <WfInst[Compiled: SCALARANDSCALARMUL-SCALARTENSOR]   : TID1413050 <= op(TID1413050(1) TID1413056(1))>
3.71e-4    | <WfInst[Compiled: MOVESCALARTENSORNODE-SCALARTENSOR] : TID1413067 <= op(TID1413067(1) TID1413050(1))>
2.91e-4    | <WfInst[Compiled: MOVESCALARTENSORNODE-SCALARTENSOR] : TID1413073 <= op(TID1413073(1) <Input>TID1413064(1))>
4.55e-4    | <WfInst[Compiled: SCALARANDSCALARMUL-SCALARTENSOR]   : TID1413067 <= op(TID1413067(1) TID1413073(1))>
3.34e-4    | <WfInst[Compiled: MOVESCALARTENSORNODE-SCALARTENSOR] : TID1413084 <= op(TID1413084(1) TID1413067(1))>
3.04e-4    | <WfInst[Compiled: MOVESCALARTENSORNODE-SCALARTENSOR] : TID1413090 <= op(TID1413090(1) <Input>TID1413081(1))>
3.81e-4    | <WfInst[Compiled: SCALARANDSCALARMUL-SCALARTENSOR]   : TID1413084 <= op(TID1413084(1) TID1413090(1))>
3.15e-4    | <WfInst[Compiled: MOVESCALARTENSORNODE-SCALARTENSOR] : TID1413101 <= op(TID1413101(1) TID1413084(1))>
3.04e-4    | <WfInst[Compiled: MOVESCALARTENSORNODE-SCALARTENSOR] : TID1413107 <= op(TID1413107(1) <Input>TID1413098(1))>
3.88e-4    | <WfInst[Compiled: SCALARANDSCALARMUL-SCALARTENSOR]   : TID1413101 <= op(TID1413101(1) TID1413107(1))>
3.31e-4    | <WfInst[Compiled: MOVESCALARTENSORNODE-SCALARTENSOR] : TID1413151 <= op(TID1413151(1) TID1413101(1))>
0.002199   | <WfInst[Compiled: SCALARDIV-CPUTENSOR]               : TID1413159 <= op(TID1413159(1 1 1 1) TID1413151(1))>

35 Instructions | 36 Tensors

 Total Time: 0.22607897 sec

 Instruction                                         | Total time (s) | Time/Total (n-sample=1000)
<WfInst[Compiled: MOVETENSORNODE-CPUTENSOR]          | 0.085865 | 37.98009%
<WfInst[Compiled: IM2COLNODE-LISPTENSOR]             | 0.08184  | 36.19974%
<WfInst[Compiled: MAXVALUE-NODE-CPUTENSOR]           | 0.019396 | 8.579303%
<WfInst[Compiled: PERMUTE-NODE-T]                    | 0.010325 | 4.566988%
<WfInst[Compiled: MoveTensorNode(SAVE_FOR_BACKWARD)] | 0.008566 | 3.7889414%
<WfInst[Compiled: ADDNODE-CPUTENSOR]                 | 0.007209 | 3.1887088%
<WfInst[Compiled: MOVESCALARTENSORNODE-SCALARTENSOR] | 0.003177 | 1.4052612%
<WfInst[Compiled: RESHAPETENSORNODE-T]               | 0.002511 | 1.1106739%
<WfInst[Compiled: SCALARDIV-CPUTENSOR]               | 0.002199 | 0.97266895%
<WfInst[Compiled: SCALARANDSCALARMUL-SCALARTENSOR]   | 0.002026 | 0.8961471%
{CPUTENSOR[float] :shape (1 1 1 1) -> :view (<(BROADCAST 1)> <(BROADCAST 1)> <(BROADCAST 1)>
                           <(BROADCAST 1)>) -> :visible-shape (1 1 1 1) :named ChainTMP1413156 
  ((((1.7790796))))
  :facet :input
  :requires-grad NIL
  :backward NIL}

[Fix] apply-in-place-mutation!

Disabled with *no-grad*=t but changed the condition to delete the MoveTensorNode and now it can delete more unused nodes