Closed hikettei closed 10 months ago
proceed-bench is a strong function to profile the computation node to know where is the bottleneck.
proceed-bench
CL-WAFFE2-REPL> (proceed-bench (!mean (call (Conv2D 3 5 `(2 2)) (randn `(10 5 10 10))))) Time(s) | Instruction ( * - Beyonds the average execution time) 6.0e-4 | <WfInst[Compiled: VIEWTENSORNODE-T] : TID1413118 <= op(TID1413118(1 1 1 1) TID1413116(1 1 1 1))> 0.001103 | <WfInst[Compiled: SCALARMUL-CPUTENSOR] : TID1412997 <= op(TID1412997(1 1 1 1) <Input>TID1412999(1))> 6.2e-4 | <WfInst[Compiled: VIEWTENSORNODE-T] : TID1413008 <= op(TID1413008(1 3 4 4) TID1412997(1 1 1 1))> 0.08184* | <WfInst[Compiled: IM2COLNODE-LISPTENSOR] : TID1412836 <= op(<Param>TID1412820(1 3 16 16) <Input>TID1412836(1 3 4 4 4 4))> 0.004949 | <WfInst[Compiled: PERMUTE-NODE-T] : TID1412847 <= op(<Input>TID1412836(1 3 4 4 4 4) <Input>TID1412847(1 4 4 3 4 4))> 0.083629* | <WfInst[Compiled: MOVETENSORNODE-CPUTENSOR] : TID1412856 <= op(TID1412856(1 4 4 3 4 4) <Input>TID1412847(1 4 4 3 4 4))> 0.00163 | <WfInst[Compiled: MoveTensorNode(SAVE_FOR_BACKWARD)] : TID1412896 <= op(TID1412896(1 4 4 3 4 4) TID1412856(1 4 4 3 4 4))> 6.42e-4 | <WfInst[Compiled: RESHAPETENSORNODE-T] : TID1412853 <= op(TID1412896(1 4 4 3 4 4) TID1412853(16 48))> 0.001543 | <WfInst[Compiled: MoveTensorNode(SAVE_FOR_BACKWARD)] : TID1412913 <= op(TID1412913(16 48) TID1412853(16 48))> 5.84e-4 | <WfInst[Compiled: RESHAPETENSORNODE-T] : TID1412910 <= op(TID1412913(16 48) TID1412910(48 16))> 0.001425 | <WfInst[Compiled: MoveTensorNode(SAVE_FOR_BACKWARD)] : TID1412930 <= op(TID1412930(48 16) TID1412910(48 16))> 6.81e-4 | <WfInst[Compiled: RESHAPETENSORNODE-T] : TID1412927 <= op(TID1412930(48 16) TID1412927(48 16))> 0.00113 | <WfInst[Compiled: MoveTensorNode(SAVE_FOR_BACKWARD)] : TID1412947 <= op(TID1412947(48 16) TID1412927(48 16))> 0.019396* | <WfInst[Compiled: MAXVALUE-NODE-CPUTENSOR] : TID1412944 <= op(TID1412947(48 16) TID1412944(48 1))> 0.001512 | <WfInst[Compiled: MoveTensorNode(SAVE_FOR_BACKWARD)] : TID1412978 <= op(TID1412978(48 1) TID1412944(48 1))> 6.04e-4 | <WfInst[Compiled: RESHAPETENSORNODE-T] : TID1412975 <= op(TID1412978(48 1) TID1412975(1 4 4 3))> 0.005376 | <WfInst[Compiled: PERMUTE-NODE-T] : TID1412991 <= op(TID1412975(1 4 4 3) TID1412991(1 3 4 4))> 0.007209* | <WfInst[Compiled: ADDNODE-CPUTENSOR] : TID1413008 <= op(TID1413008(1 3 4 4) TID1412991(1 3 4 4))> 6.42e-4 | <WfInst[Compiled: VIEWTENSORNODE-T] : TID1413040 <= op(TID1413040(1 1 1 1) TID1413008(1 3 4 4))> 0.002236 | <WfInst[Compiled: MOVETENSORNODE-CPUTENSOR] : TID1413118 <= op(TID1413118(1 1 1 1) TID1413040(1 1 1 1))> 0.001326 | <WfInst[Compiled: MoveTensorNode(SAVE_FOR_BACKWARD)] : TID1413159 <= op(TID1413159(1 1 1 1) TID1413118(1 1 1 1))> 5.97e-4 | <WfInst[Compiled: MOVESCALARTENSORNODE-SCALARTENSOR] : TID1413050 <= op(TID1413050(1) <Input>TID1413045(1))> 3.3e-4 | <WfInst[Compiled: MOVESCALARTENSORNODE-SCALARTENSOR] : TID1413056 <= op(TID1413056(1) <Input>TID1413047(1))> 8.02e-4 | <WfInst[Compiled: SCALARANDSCALARMUL-SCALARTENSOR] : TID1413050 <= op(TID1413050(1) TID1413056(1))> 3.71e-4 | <WfInst[Compiled: MOVESCALARTENSORNODE-SCALARTENSOR] : TID1413067 <= op(TID1413067(1) TID1413050(1))> 2.91e-4 | <WfInst[Compiled: MOVESCALARTENSORNODE-SCALARTENSOR] : TID1413073 <= op(TID1413073(1) <Input>TID1413064(1))> 4.55e-4 | <WfInst[Compiled: SCALARANDSCALARMUL-SCALARTENSOR] : TID1413067 <= op(TID1413067(1) TID1413073(1))> 3.34e-4 | <WfInst[Compiled: MOVESCALARTENSORNODE-SCALARTENSOR] : TID1413084 <= op(TID1413084(1) TID1413067(1))> 3.04e-4 | <WfInst[Compiled: MOVESCALARTENSORNODE-SCALARTENSOR] : TID1413090 <= op(TID1413090(1) <Input>TID1413081(1))> 3.81e-4 | <WfInst[Compiled: SCALARANDSCALARMUL-SCALARTENSOR] : TID1413084 <= op(TID1413084(1) TID1413090(1))> 3.15e-4 | <WfInst[Compiled: MOVESCALARTENSORNODE-SCALARTENSOR] : TID1413101 <= op(TID1413101(1) TID1413084(1))> 3.04e-4 | <WfInst[Compiled: MOVESCALARTENSORNODE-SCALARTENSOR] : TID1413107 <= op(TID1413107(1) <Input>TID1413098(1))> 3.88e-4 | <WfInst[Compiled: SCALARANDSCALARMUL-SCALARTENSOR] : TID1413101 <= op(TID1413101(1) TID1413107(1))> 3.31e-4 | <WfInst[Compiled: MOVESCALARTENSORNODE-SCALARTENSOR] : TID1413151 <= op(TID1413151(1) TID1413101(1))> 0.002199 | <WfInst[Compiled: SCALARDIV-CPUTENSOR] : TID1413159 <= op(TID1413159(1 1 1 1) TID1413151(1))> 35 Instructions | 36 Tensors Total Time: 0.22607897 sec Instruction | Total time (s) | Time/Total (n-sample=1000) <WfInst[Compiled: MOVETENSORNODE-CPUTENSOR] | 0.085865 | 37.98009% <WfInst[Compiled: IM2COLNODE-LISPTENSOR] | 0.08184 | 36.19974% <WfInst[Compiled: MAXVALUE-NODE-CPUTENSOR] | 0.019396 | 8.579303% <WfInst[Compiled: PERMUTE-NODE-T] | 0.010325 | 4.566988% <WfInst[Compiled: MoveTensorNode(SAVE_FOR_BACKWARD)] | 0.008566 | 3.7889414% <WfInst[Compiled: ADDNODE-CPUTENSOR] | 0.007209 | 3.1887088% <WfInst[Compiled: MOVESCALARTENSORNODE-SCALARTENSOR] | 0.003177 | 1.4052612% <WfInst[Compiled: RESHAPETENSORNODE-T] | 0.002511 | 1.1106739% <WfInst[Compiled: SCALARDIV-CPUTENSOR] | 0.002199 | 0.97266895% <WfInst[Compiled: SCALARANDSCALARMUL-SCALARTENSOR] | 0.002026 | 0.8961471% {CPUTENSOR[float] :shape (1 1 1 1) -> :view (<(BROADCAST 1)> <(BROADCAST 1)> <(BROADCAST 1)> <(BROADCAST 1)>) -> :visible-shape (1 1 1 1) :named ChainTMP1413156 ((((1.7790796)))) :facet :input :requires-grad NIL :backward NIL}
Disabled with *no-grad*=t but changed the condition to delete the MoveTensorNode and now it can delete more unused nodes
*no-grad*
Introducing proceed-bench
proceed-bench
is a strong function to profile the computation node to know where is the bottleneck.[Fix] apply-in-place-mutation!
Disabled with
*no-grad*
=t but changed the condition to delete the MoveTensorNode and now it can delete more unused nodes