Closed johndpope closed 1 month ago
thanks to @cli99 - i get this view of computational complexity - see full txt file
there's profiling branch https://github.com/johndpope/MegaPortrait-hack/tree/feat/26-auditflops
i keep digging - but at first glance - seems like the Conv2d from resnets is big factor. probably using deptchwise convolution in base resnet models would see 9x speed up in training / inference times. https://www.reddit.com/r/StableDiffusion/comments/1bh970h/claude_3_thinks_4_lines_of_code_changes_will/
adding novel architectures probably just lipstick on a pig.
Top 1 modules in terms of params, flops, MACs or duration at different model depths: depth 0: params - {'Gbase': '149.33 M'} flops - {'Gbase': '7079.63 G'} MACs - {'Gbase': '3532.31 GMACs'} fwd latency - {'Gbase': '260.03 ms'} depth 1: params - {'G3d': '48.56 M'} flops - {'Eapp': '2707.29 G'} MACs - {'Eapp': '1351.3 GMACs'} fwd latency - {'Emtn': '171.0 ms'} depth 2: params - {'Sequential': '99.76 M'} flops - {'Sequential': '2606.6 G'} MACs - {'Sequential': '1300.07 GMACs'} fwd latency - {'Sequential': '113.98 ms'} depth 3: params - {'ResBlock3D': '48.32 M'} flops - {'Sequential': '1521.42 G'} MACs - {'Sequential': '758.2 GMACs'} fwd latency - {'Sequential': '117.84 ms'} depth 4: params - {'Conv3d': '62.76 M'} flops - {'Conv2d': '1449.83 G'} MACs - {'Conv2d': '724.78 GMACs'} fwd latency - {'BasicBlock': '93.86 ms'} depth 5: params - {'Conv2d': '29.97 M'} flops - {'Conv2d': '1530.98 G'} MACs - {'Conv2d': '765.44 GMACs'} fwd latency - {'Conv2d': '52.09 ms'}
Loading processed tensors from file: junk/-2KGPYEFnsU_11/-2KGPYEFnsU_11_tensors.npz Loading processed tensors from file: junk/-2KGPYEFnsU_8/-2KGPYEFnsU_8_tensors.npz Weights already downloaded. Skipping download. Setting up [LPIPS] perceptual loss: trunk [vgg], v[0.1], spatial [off] Loading model from: /home/oem/miniconda3/envs/comfyui/lib/python3.11/site-packages/lpips/weights/v0.1/vgg.pth Epoch: 0 Loading processed tensors from file: junk/M2Ohb0FAaJU_1/M2Ohb0FAaJU_1_tensors.npz Loading processed tensors from file: junk/-1eKufUP5XQ_4/-1eKufUP5XQ_4_tensors.npz -------------------------- Flops Profiler -------------------------- Profile on Device: cuda:0 Profile Summary at step 5: Notations: data parallel size (dp_size), model parallel size(mp_size), number of parameters (params), number of multiply-accumulate operations(MACs), number of floating-point operations (flops), floating-point operations per second (FLOPS), fwd latency (forward propagation latency), bwd latency (backward propagation latency), step (weights update latency), iter latency (sum of fwd, bwd and step latency) params per device: 2.77 M params of model = params per device * mp_size: 2.77 M fwd MACs per device: 54.83 GMACs fwd flops per device: 109.9 G fwd flops of model = fwd flops per device * mp_size: 109.9 G fwd latency: 7.16 ms fwd FLOPS per device = fwd flops per device / fwd latency: 15.35 TFLOPS ----------------------------- Aggregated Profile per Device ----------------------------- Top 1 modules in terms of params, flops, MACs or duration at different model depths: depth 0: params - {'Discriminator': '2.77 M'} flops - {'Discriminator': '109.9 G'} MACs - {'Discriminator': '54.83 GMACs'} fwd latency - {'Discriminator': '7.16 ms'} depth 1: params - {'Sequential': '2.77 M'} flops - {'Sequential': '109.9 G'} MACs - {'Sequential': '54.83 GMACs'} fwd latency - {'Sequential': '6.92 ms'} ------------------------------ Detailed Profile per Device ------------------------------ Each module profile is listed after its name in the following order: params, percentage of total params, MACs, percentage of total MACs, fwd latency, percentage of total fwd latency, fwd FLOPS Note: 1. A module can have torch.nn.module or torch.nn.functional to compute logits (e.g. CrossEntropyLoss). They are not counted as submodules, thus not to be printed out. However they make up the difference between a parent's MACs (or latency) and the sum of its submodules'. 2. Number of floating-point operations is a theoretical estimation, thus FLOPS computed using that could be larger than the maximum system throughput. 3. The fwd latency listed in the top module's profile is directly captured at the module forward function in PyTorch. Discriminator( module = {'param': '2.77 M', 'flops': '109.9 G', 'macs': '54.83 GMACs', 'duration': '7.16 ms', 'FLOPS': '15.35 TFLOPS', 'params%': '100.00%', 'flops%': '100.00%', 'macs%': '100.00%', 'duration%': '100.00%'}, functionals = {'conv2d': {'flops': '54.86 G', 'macs': '27.41 GMACs', 'duration': '2.95 ms', 'FLOPS': '18.58 TFLOPS', 'flops%': '49.92%', 'macs%': '50.00%', 'duration%/allfuncs': '34.02%', 'duration%/e2e': '41.24%'}, 'newFunc': {'flops': '54.95 G', 'macs': '27.41 GMACs', 'duration': '4.84 ms', 'FLOPS': '11.34 TFLOPS', 'flops%': '50.00%', 'macs%': '50.00%', 'duration%/allfuncs': '55.82%', 'duration%/e2e': '67.68%'}, 'leaky_relu': {'flops': '31.46 M', 'macs': '0 MACs', 'duration': '334.85 us', 'FLOPS': '93.94 GFLOPS', 'flops%': '0.03%', 'macs%': '0.00%', 'duration%/allfuncs': '3.86%', 'duration%/e2e': '4.68%'}, 'instance_norm': {'flops': '58.72 M', 'macs': '0 MACs', 'duration': '546.82 us', 'FLOPS': '107.39 GFLOPS', 'flops%': '0.05%', 'macs%': '0.00%', 'duration%/allfuncs': '6.30%', 'duration%/e2e': '7.64%'}}, functionals_duration = 8.68 ms, (model): Sequential( module = {'param': '2.77 M', 'flops': '109.9 G', 'macs': '54.83 GMACs', 'duration': '6.92 ms', 'FLOPS': '15.88 TFLOPS', 'params%': '100.00%', 'flops%': '100.00%', 'macs%': '100.00%', 'duration%': '96.68%'}, functionals = {'conv2d': {'flops': '54.86 G', 'macs': '27.41 GMACs', 'duration': '2.95 ms', 'FLOPS': '18.58 TFLOPS', 'flops%': '49.92%', 'macs%': '50.00%', 'duration%/allfuncs': '34.02%', 'duration%/e2e': '41.24%'}, 'newFunc': {'flops': '54.95 G', 'macs': '27.41 GMACs', 'duration': '4.84 ms', 'FLOPS': '11.34 TFLOPS', 'flops%': '50.00%', 'macs%': '50.00%', 'duration%/allfuncs': '55.82%', 'duration%/e2e': '67.68%'}, 'leaky_relu': {'flops': '31.46 M', 'macs': '0 MACs', 'duration': '334.85 us', 'FLOPS': '93.94 GFLOPS', 'flops%': '0.03%', 'macs%': '0.00%', 'duration%/allfuncs': '3.86%', 'duration%/e2e': '4.68%'}, 'instance_norm': {'flops': '58.72 M', 'macs': '0 MACs', 'duration': '546.82 us', 'FLOPS': '107.39 GFLOPS', 'flops%': '0.05%', 'macs%': '0.00%', 'duration%/allfuncs': '6.30%', 'duration%/e2e': '7.64%'}}, functionals_duration = 8.68 ms, (0): Conv2d(module = {'param': '6.21 k', 'flops': '6.48 G', 'macs': '3.22 GMACs', 'duration': '826.84 us', 'FLOPS': '7.83 TFLOPS', 'params%': '0.22%', 'flops%': '5.89%', 'macs%': '5.88%', 'duration%': '11.55%'}, functionals = {'conv2d': {'flops': '3.24 G', 'macs': '1.61 GMACs', 'duration': '551.94 us', 'FLOPS': '5.87 TFLOPS', 'flops%': '2.95%', 'macs%': '2.94%', 'duration%/allfuncs': '6.36%', 'duration%/e2e': '7.71%'}, 'newFunc': {'flops': '3.24 G', 'macs': '1.61 GMACs', 'duration': '651.26 us', 'FLOPS': '4.97 TFLOPS', 'flops%': '2.95%', 'macs%': '2.94%', 'duration%/allfuncs': '7.50%', 'duration%/e2e': '9.10%'}}, functionals_duration = 1.2 ms, 6, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1)) (1): LeakyReLU(module = {'param': '0', 'flops': '33.55 M', 'macs': '0 MACs', 'duration': '315.43 us', 'FLOPS': '106.38 GFLOPS', 'params%': '0.00%', 'flops%': '0.03%', 'macs%': '0.00%', 'duration%': '4.41%'}, functionals = {'leaky_relu': {'flops': '16.78 M', 'macs': '0 MACs', 'duration': '134.14 us', 'FLOPS': '125.07 GFLOPS', 'flops%': '0.02%', 'macs%': '0.00%', 'duration%/allfuncs': '1.55%', 'duration%/e2e': '1.87%'}, 'newFunc': {'flops': '16.78 M', 'macs': '0 MACs', 'duration': '214.02 us', 'FLOPS': '78.39 GFLOPS', 'flops%': '0.02%', 'macs%': '0.00%', 'duration%/allfuncs': '2.47%', 'duration%/e2e': '2.99%'}}, functionals_duration = 348.16 us, negative_slope=0.2, inplace=True) (2): Conv2d(module = {'param': '131.2 k', 'flops': '34.38 G', 'macs': '17.18 GMACs', 'duration': '800.13 us', 'FLOPS': '42.96 TFLOPS', 'params%': '4.74%', 'flops%': '31.28%', 'macs%': '31.33%', 'duration%': '11.18%'}, functionals = {'conv2d': {'flops': '17.19 G', 'macs': '8.59 GMACs', 'duration': '587.78 us', 'FLOPS': '29.24 TFLOPS', 'flops%': '15.64%', 'macs%': '15.67%', 'duration%/allfuncs': '6.77%', 'duration%/e2e': '8.21%'}, 'newFunc': {'flops': '17.19 G', 'macs': '8.59 GMACs', 'duration': '675.84 us', 'FLOPS': '25.43 TFLOPS', 'flops%': '15.64%', 'macs%': '15.67%', 'duration%/allfuncs': '7.79%', 'duration%/e2e': '9.44%'}}, functionals_duration = 1.26 ms, 64, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1)) (3): InstanceNorm2d(module = {'param': '0', 'flops': '67.11 M', 'macs': '0 MACs', 'duration': '468.49 us', 'FLOPS': '143.24 GFLOPS', 'params%': '0.00%', 'flops%': '0.06%', 'macs%': '0.00%', 'duration%': '6.55%'}, functionals = {'instance_norm': {'flops': '33.55 M', 'macs': '0 MACs', 'duration': '244.74 us', 'FLOPS': '137.1 GFLOPS', 'flops%': '0.03%', 'macs%': '0.00%', 'duration%/allfuncs': '2.82%', 'duration%/e2e': '3.42%'}, 'newFunc': {'flops': '33.55 M', 'macs': '0 MACs', 'duration': '334.85 us', 'FLOPS': '100.21 GFLOPS', 'flops%': '0.03%', 'macs%': '0.00%', 'duration%/allfuncs': '3.86%', 'duration%/e2e': '4.68%'}}, functionals_duration = 579.58 us, 128, eps=1e-05, momentum=0.1, affine=False, track_running_stats=False) (4): LeakyReLU(module = {'param': '0', 'flops': '16.78 M', 'macs': '0 MACs', 'duration': '239.61 us', 'FLOPS': '70.02 GFLOPS', 'params%': '0.00%', 'flops%': '0.02%', 'macs%': '0.00%', 'duration%': '3.35%'}, functionals = {'leaky_relu': {'flops': '8.39 M', 'macs': '0 MACs', 'duration': '74.75 us', 'FLOPS': '112.22 GFLOPS', 'flops%': '0.01%', 'macs%': '0.00%', 'duration%/allfuncs': '0.86%', 'duration%/e2e': '1.04%'}, 'newFunc': {'flops': '8.39 M', 'macs': '0 MACs', 'duration': '148.48 us', 'FLOPS': '56.5 GFLOPS', 'flops%': '0.01%', 'macs%': '0.00%', 'duration%/allfuncs': '1.71%', 'duration%/e2e': '2.07%'}}, functionals_duration = 223.23 us, negative_slope=0.2, inplace=True) (5): Conv2d(module = {'param': '524.54 k', 'flops': '34.37 G', 'macs': '17.18 GMACs', 'duration': '713.83 us', 'FLOPS': '48.15 TFLOPS', 'params%': '18.95%', 'flops%': '31.27%', 'macs%': '31.33%', 'duration%': '9.97%'}, functionals = {'conv2d': {'flops': '17.18 G', 'macs': '8.59 GMACs', 'duration': '490.5 us', 'FLOPS': '35.03 TFLOPS', 'flops%': '15.64%', 'macs%': '15.67%', 'duration%/allfuncs': '5.65%', 'duration%/e2e': '6.85%'}, 'newFunc': {'flops': '17.18 G', 'macs': '8.59 GMACs', 'duration': '586.75 us', 'FLOPS': '29.29 TFLOPS', 'flops%': '15.64%', 'macs%': '15.67%', 'duration%/allfuncs': '6.76%', 'duration%/e2e': '8.20%'}}, functionals_duration = 1.08 ms, 128, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1)) (6): InstanceNorm2d(module = {'param': '0', 'flops': '33.55 M', 'macs': '0 MACs', 'duration': '348.09 us', 'FLOPS': '96.4 GFLOPS', 'params%': '0.00%', 'flops%': '0.03%', 'macs%': '0.00%', 'duration%': '4.86%'}, functionals = {'instance_norm': {'flops': '16.78 M', 'macs': '0 MACs', 'duration': '152.58 us', 'FLOPS': '109.96 GFLOPS', 'flops%': '0.02%', 'macs%': '0.00%', 'duration%/allfuncs': '1.76%', 'duration%/e2e': '2.13%'}, 'newFunc': {'flops': '16.78 M', 'macs': '0 MACs', 'duration': '229.38 us', 'FLOPS': '73.14 GFLOPS', 'flops%': '0.02%', 'macs%': '0.00%', 'duration%/allfuncs': '2.64%', 'duration%/e2e': '3.20%'}}, functionals_duration = 381.95 us, 256, eps=1e-05, momentum=0.1, affine=False, track_running_stats=False) (7): LeakyReLU(module = {'param': '0', 'flops': '8.39 M', 'macs': '0 MACs', 'duration': '226.97 us', 'FLOPS': '36.96 GFLOPS', 'params%': '0.00%', 'flops%': '0.01%', 'macs%': '0.00%', 'duration%': '3.17%'}, functionals = {'leaky_relu': {'flops': '4.19 M', 'macs': '0 MACs', 'duration': '61.44 us', 'FLOPS': '68.27 GFLOPS', 'flops%': '0.00%', 'macs%': '0.00%', 'duration%/allfuncs': '0.71%', 'duration%/e2e': '0.86%'}, 'newFunc': {'flops': '4.19 M', 'macs': '0 MACs', 'duration': '134.14 us', 'FLOPS': '31.27 GFLOPS', 'flops%': '0.00%', 'macs%': '0.00%', 'duration%/allfuncs': '1.55%', 'duration%/e2e': '1.87%'}}, functionals_duration = 195.58 us, negative_slope=0.2, inplace=True) (8): Conv2d(module = {'param': '2.1 M', 'flops': '34.36 G', 'macs': '17.18 GMACs', 'duration': '762.46 us', 'FLOPS': '45.07 TFLOPS', 'params%': '75.79%', 'flops%': '31.27%', 'macs%': '31.33%', 'duration%': '10.65%'}, functionals = {'conv2d': {'flops': '17.18 G', 'macs': '8.59 GMACs', 'duration': '519.17 us', 'FLOPS': '33.1 TFLOPS', 'flops%': '15.63%', 'macs%': '15.67%', 'duration%/allfuncs': '5.98%', 'duration%/e2e': '7.25%'}, 'newFunc': {'flops': '17.18 G', 'macs': '8.59 GMACs', 'duration': '612.35 us', 'FLOPS': '28.06 TFLOPS', 'flops%': '15.63%', 'macs%': '15.67%', 'duration%/allfuncs': '7.06%', 'duration%/e2e': '8.56%'}}, functionals_duration = 1.13 ms, 256, 512, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1)) (9): InstanceNorm2d(module = {'param': '0', 'flops': '16.78 M', 'macs': '0 MACs', 'duration': '341.18 us', 'FLOPS': '49.17 GFLOPS', 'params%': '0.00%', 'flops%': '0.02%', 'macs%': '0.00%', 'duration%': '4.77%'}, functionals = {'instance_norm': {'flops': '8.39 M', 'macs': '0 MACs', 'duration': '149.5 us', 'FLOPS': '56.11 GFLOPS', 'flops%': '0.01%', 'macs%': '0.00%', 'duration%/allfuncs': '1.72%', 'duration%/e2e': '2.09%'}, 'newFunc': {'flops': '8.39 M', 'macs': '0 MACs', 'duration': '223.23 us', 'FLOPS': '37.58 GFLOPS', 'flops%': '0.01%', 'macs%': '0.00%', 'duration%/allfuncs': '2.57%', 'duration%/e2e': '3.12%'}}, functionals_duration = 372.74 us, 512, eps=1e-05, momentum=0.1, affine=False, track_running_stats=False) (10): LeakyReLU(module = {'param': '0', 'flops': '4.19 M', 'macs': '0 MACs', 'duration': '234.37 us', 'FLOPS': '17.9 GFLOPS', 'params%': '0.00%', 'flops%': '0.00%', 'macs%': '0.00%', 'duration%': '3.27%'}, functionals = {'leaky_relu': {'flops': '2.1 M', 'macs': '0 MACs', 'duration': '64.51 us', 'FLOPS': '32.51 GFLOPS', 'flops%': '0.00%', 'macs%': '0.00%', 'duration%/allfuncs': '0.74%', 'duration%/e2e': '0.90%'}, 'newFunc': {'flops': '2.1 M', 'macs': '0 MACs', 'duration': '138.24 us', 'FLOPS': '15.17 GFLOPS', 'flops%': '0.00%', 'macs%': '0.00%', 'duration%/allfuncs': '1.59%', 'duration%/e2e': '1.93%'}}, functionals_duration = 202.75 us, negative_slope=0.2, inplace=True) (11): ZeroPad2d(module = {'param': '0', 'flops': '0', 'macs': '0 MACs', 'duration': '180.72 us', 'FLOPS': '0.0 FLOPS', 'params%': '0.00%', 'flops%': '0.00%', 'macs%': '0.00%', 'duration%': '2.52%'}, functionals = {}, functionals_duration = 0.0, (1, 0, 1, 0)) (12): Conv2d(module = {'param': '8.19 k', 'flops': '134.22 M', 'macs': '67.11 MMACs', 'duration': '1.03 ms', 'FLOPS': '129.95 GFLOPS', 'params%': '0.30%', 'flops%': '0.12%', 'macs%': '0.12%', 'duration%': '14.43%'}, functionals = {'conv2d': {'flops': '67.11 M', 'macs': '33.55 MMACs', 'duration': '802.82 us', 'FLOPS': '83.59 GFLOPS', 'flops%': '0.06%', 'macs%': '0.06%', 'duration%/allfuncs': '9.25%', 'duration%/e2e': '11.22%'}, 'newFunc': {'flops': '67.11 M', 'macs': '33.55 MMACs', 'duration': '896.0 us', 'FLOPS': '74.9 GFLOPS', 'flops%': '0.06%', 'macs%': '0.06%', 'duration%/allfuncs': '10.32%', 'duration%/e2e': '12.52%'}}, functionals_duration = 1.7 ms, 512, 1, kernel_size=(4, 4), stride=(1, 1), padding=(1, 1), bias=False) ) ) ------------------------------------------------------------------------------ Step 5: FLOPS - 109.9 G, MACs - 54.83 GMACs, Params - 2.77 M -------------------------- Flops Profiler -------------------------- Profile on Device: cuda:0 Profile Summary at step 5: Notations: data parallel size (dp_size), model parallel size(mp_size), number of parameters (params), number of multiply-accumulate operations(MACs), number of floating-point operations (flops), floating-point operations per second (FLOPS), fwd latency (forward propagation latency), bwd latency (backward propagation latency), step (weights update latency), iter latency (sum of fwd, bwd and step latency) params per device: 149.33 M params of model = params per device * mp_size: 149.33 M fwd MACs per device: 3532.31 GMACs fwd flops per device: 7079.63 G fwd flops of model = fwd flops per device * mp_size: 7079.63 G fwd latency: 260.03 ms fwd FLOPS per device = fwd flops per device / fwd latency: 27.23 TFLOPS ----------------------------- Aggregated Profile per Device ----------------------------- Top 1 modules in terms of params, flops, MACs or duration at different model depths: depth 0: params - {'Gbase': '149.33 M'} flops - {'Gbase': '7079.63 G'} MACs - {'Gbase': '3532.31 GMACs'} fwd latency - {'Gbase': '260.03 ms'} depth 1: params - {'G3d': '48.56 M'} flops - {'Eapp': '2707.29 G'} MACs - {'Eapp': '1351.3 GMACs'} fwd latency - {'Emtn': '171.0 ms'} depth 2: params - {'Sequential': '99.76 M'} flops - {'Sequential': '2606.6 G'} MACs - {'Sequential': '1300.07 GMACs'} fwd latency - {'Sequential': '113.98 ms'} depth 3: params - {'ResBlock3D': '48.32 M'} flops - {'Sequential': '1521.42 G'} MACs - {'Sequential': '758.2 GMACs'} fwd latency - {'Sequential': '117.84 ms'} depth 4: params - {'Conv3d': '62.76 M'} flops - {'Conv2d': '1449.83 G'} MACs - {'Conv2d': '724.78 GMACs'} fwd latency - {'BasicBlock': '93.86 ms'} depth 5: params - {'Conv2d': '29.97 M'} flops - {'Conv2d': '1530.98 G'} MACs - {'Conv2d': '765.44 GMACs'} fwd latency - {'Conv2d': '52.09 ms'} ------------------------------------------------------------------------------ Step 5: FLOPS - 109.9 G, MACs - 54.83 GMACs, Params - 2.77 M -------------------------- Flops Profiler -------------------------- Profile on Device: cuda:0 Profile Summary at step 5: Notations: data parallel size (dp_size), model parallel size(mp_size), number of parameters (params), number of multiply-accumulate operations(MACs), number of floating-point operations (flops), floating-point operations per second (FLOPS), fwd latency (forward propagation latency), bwd latency (backward propagation latency), step (weights update latency), iter latency (sum of fwd, bwd and step latency) params per device: 149.33 M params of model = params per device * mp_size: 149.33 M fwd MACs per device: 3532.31 GMACs fwd flops per device: 7079.63 G fwd flops of model = fwd flops per device * mp_size: 7079.63 G fwd latency: 260.03 ms fwd FLOPS per device = fwd flops per device / fwd latency: 27.23 TFLOPS
out.txt
thanks to @cli99 - i get this view of computational complexity - see full txt file
there's profiling branch https://github.com/johndpope/MegaPortrait-hack/tree/feat/26-auditflops
i keep digging - but at first glance - seems like the Conv2d from resnets is big factor. probably using deptchwise convolution in base resnet models would see 9x speed up in training / inference times. https://www.reddit.com/r/StableDiffusion/comments/1bh970h/claude_3_thinks_4_lines_of_code_changes_will/
adding novel architectures probably just lipstick on a pig.
Top 1 modules in terms of params, flops, MACs or duration at different model depths: depth 0: params - {'Gbase': '149.33 M'} flops - {'Gbase': '7079.63 G'} MACs - {'Gbase': '3532.31 GMACs'} fwd latency - {'Gbase': '260.03 ms'} depth 1: params - {'G3d': '48.56 M'} flops - {'Eapp': '2707.29 G'} MACs - {'Eapp': '1351.3 GMACs'} fwd latency - {'Emtn': '171.0 ms'} depth 2: params - {'Sequential': '99.76 M'} flops - {'Sequential': '2606.6 G'} MACs - {'Sequential': '1300.07 GMACs'} fwd latency - {'Sequential': '113.98 ms'} depth 3: params - {'ResBlock3D': '48.32 M'} flops - {'Sequential': '1521.42 G'} MACs - {'Sequential': '758.2 GMACs'} fwd latency - {'Sequential': '117.84 ms'} depth 4: params - {'Conv3d': '62.76 M'} flops - {'Conv2d': '1449.83 G'} MACs - {'Conv2d': '724.78 GMACs'} fwd latency - {'BasicBlock': '93.86 ms'} depth 5: params - {'Conv2d': '29.97 M'} flops - {'Conv2d': '1530.98 G'} MACs - {'Conv2d': '765.44 GMACs'} fwd latency - {'Conv2d': '52.09 ms'}
out.txt