johndpope / MegaPortrait-hack

Using Claude Opus to reverse engineer code from MegaPortraits: One-shot Megapixel Neural Head Avatars
https://arxiv.org/abs/2207.07621
68 stars 7 forks source link

Audit Flops - generator / discriminator #39

Closed johndpope closed 1 month ago

johndpope commented 3 months ago

thanks to @cli99 - i get this view of computational complexity - see full txt file

there's profiling branch https://github.com/johndpope/MegaPortrait-hack/tree/feat/26-auditflops

i keep digging - but at first glance - seems like the Conv2d from resnets is big factor. probably using deptchwise convolution in base resnet models would see 9x speed up in training / inference times. https://www.reddit.com/r/StableDiffusion/comments/1bh970h/claude_3_thinks_4_lines_of_code_changes_will/

adding novel architectures probably just lipstick on a pig.

Top 1 modules in terms of params, flops, MACs or duration at different model depths: depth 0: params - {'Gbase': '149.33 M'} flops - {'Gbase': '7079.63 G'} MACs - {'Gbase': '3532.31 GMACs'} fwd latency - {'Gbase': '260.03 ms'} depth 1: params - {'G3d': '48.56 M'} flops - {'Eapp': '2707.29 G'} MACs - {'Eapp': '1351.3 GMACs'} fwd latency - {'Emtn': '171.0 ms'} depth 2: params - {'Sequential': '99.76 M'} flops - {'Sequential': '2606.6 G'} MACs - {'Sequential': '1300.07 GMACs'} fwd latency - {'Sequential': '113.98 ms'} depth 3: params - {'ResBlock3D': '48.32 M'} flops - {'Sequential': '1521.42 G'} MACs - {'Sequential': '758.2 GMACs'} fwd latency - {'Sequential': '117.84 ms'} depth 4: params - {'Conv3d': '62.76 M'} flops - {'Conv2d': '1449.83 G'} MACs - {'Conv2d': '724.78 GMACs'} fwd latency - {'BasicBlock': '93.86 ms'} depth 5: params - {'Conv2d': '29.97 M'} flops - {'Conv2d': '1530.98 G'} MACs - {'Conv2d': '765.44 GMACs'} fwd latency - {'Conv2d': '52.09 ms'}

Loading processed tensors from file: junk/-2KGPYEFnsU_11/-2KGPYEFnsU_11_tensors.npz
Loading processed tensors from file: junk/-2KGPYEFnsU_8/-2KGPYEFnsU_8_tensors.npz
Weights already downloaded. Skipping download.
Setting up [LPIPS] perceptual loss: trunk [vgg], v[0.1], spatial [off]
Loading model from: /home/oem/miniconda3/envs/comfyui/lib/python3.11/site-packages/lpips/weights/v0.1/vgg.pth
Epoch: 0
Loading processed tensors from file: junk/M2Ohb0FAaJU_1/M2Ohb0FAaJU_1_tensors.npz
Loading processed tensors from file: junk/-1eKufUP5XQ_4/-1eKufUP5XQ_4_tensors.npz

-------------------------- Flops Profiler --------------------------
Profile on Device: cuda:0
Profile Summary at step 5:
Notations:
data parallel size (dp_size), model parallel size(mp_size),
number of parameters (params), number of multiply-accumulate operations(MACs),
number of floating-point operations (flops), floating-point operations per second (FLOPS),
fwd latency (forward propagation latency), bwd latency (backward propagation latency),
step (weights update latency), iter latency (sum of fwd, bwd and step latency)

params per device:                                            2.77 M  
params of model = params per device * mp_size:                2.77 M  
fwd MACs per device:                                          54.83 GMACs
fwd flops per device:                                         109.9 G 
fwd flops of model = fwd flops per device * mp_size:          109.9 G 
fwd latency:                                                  7.16 ms 
fwd FLOPS per device = fwd flops per device / fwd latency:    15.35 TFLOPS

----------------------------- Aggregated Profile per Device -----------------------------
Top 1 modules in terms of params, flops, MACs or duration at different model depths:
depth 0:
    params      - {'Discriminator': '2.77 M'}
    flops       - {'Discriminator': '109.9 G'}
    MACs        - {'Discriminator': '54.83 GMACs'}
    fwd latency - {'Discriminator': '7.16 ms'}
depth 1:
    params      - {'Sequential': '2.77 M'}
    flops       - {'Sequential': '109.9 G'}
    MACs        - {'Sequential': '54.83 GMACs'}
    fwd latency - {'Sequential': '6.92 ms'}

------------------------------ Detailed Profile per Device ------------------------------
Each module profile is listed after its name in the following order: 
params, percentage of total params, MACs, percentage of total MACs, fwd latency, percentage of total fwd latency, fwd FLOPS

Note: 1. A module can have torch.nn.module or torch.nn.functional to compute logits (e.g. CrossEntropyLoss). They are not counted as submodules, thus not to be printed out. However they make up the difference between a parent's MACs (or latency) and the sum of its submodules'.
2. Number of floating-point operations is a theoretical estimation, thus FLOPS computed using that could be larger than the maximum system throughput.
3. The fwd latency listed in the top module's profile is directly captured at the module forward function in PyTorch.

Discriminator(
  module = {'param': '2.77 M', 'flops': '109.9 G', 'macs': '54.83 GMACs', 'duration': '7.16 ms', 'FLOPS': '15.35 TFLOPS', 'params%': '100.00%', 'flops%': '100.00%', 'macs%': '100.00%', 'duration%': '100.00%'}, functionals = {'conv2d': {'flops': '54.86 G', 'macs': '27.41 GMACs', 'duration': '2.95 ms', 'FLOPS': '18.58 TFLOPS', 'flops%': '49.92%', 'macs%': '50.00%', 'duration%/allfuncs': '34.02%', 'duration%/e2e': '41.24%'}, 'newFunc': {'flops': '54.95 G', 'macs': '27.41 GMACs', 'duration': '4.84 ms', 'FLOPS': '11.34 TFLOPS', 'flops%': '50.00%', 'macs%': '50.00%', 'duration%/allfuncs': '55.82%', 'duration%/e2e': '67.68%'}, 'leaky_relu': {'flops': '31.46 M', 'macs': '0 MACs', 'duration': '334.85 us', 'FLOPS': '93.94 GFLOPS', 'flops%': '0.03%', 'macs%': '0.00%', 'duration%/allfuncs': '3.86%', 'duration%/e2e': '4.68%'}, 'instance_norm': {'flops': '58.72 M', 'macs': '0 MACs', 'duration': '546.82 us', 'FLOPS': '107.39 GFLOPS', 'flops%': '0.05%', 'macs%': '0.00%', 'duration%/allfuncs': '6.30%', 'duration%/e2e': '7.64%'}}, functionals_duration = 8.68 ms, 
  (model): Sequential(
    module = {'param': '2.77 M', 'flops': '109.9 G', 'macs': '54.83 GMACs', 'duration': '6.92 ms', 'FLOPS': '15.88 TFLOPS', 'params%': '100.00%', 'flops%': '100.00%', 'macs%': '100.00%', 'duration%': '96.68%'}, functionals = {'conv2d': {'flops': '54.86 G', 'macs': '27.41 GMACs', 'duration': '2.95 ms', 'FLOPS': '18.58 TFLOPS', 'flops%': '49.92%', 'macs%': '50.00%', 'duration%/allfuncs': '34.02%', 'duration%/e2e': '41.24%'}, 'newFunc': {'flops': '54.95 G', 'macs': '27.41 GMACs', 'duration': '4.84 ms', 'FLOPS': '11.34 TFLOPS', 'flops%': '50.00%', 'macs%': '50.00%', 'duration%/allfuncs': '55.82%', 'duration%/e2e': '67.68%'}, 'leaky_relu': {'flops': '31.46 M', 'macs': '0 MACs', 'duration': '334.85 us', 'FLOPS': '93.94 GFLOPS', 'flops%': '0.03%', 'macs%': '0.00%', 'duration%/allfuncs': '3.86%', 'duration%/e2e': '4.68%'}, 'instance_norm': {'flops': '58.72 M', 'macs': '0 MACs', 'duration': '546.82 us', 'FLOPS': '107.39 GFLOPS', 'flops%': '0.05%', 'macs%': '0.00%', 'duration%/allfuncs': '6.30%', 'duration%/e2e': '7.64%'}}, functionals_duration = 8.68 ms, 
    (0): Conv2d(module = {'param': '6.21 k', 'flops': '6.48 G', 'macs': '3.22 GMACs', 'duration': '826.84 us', 'FLOPS': '7.83 TFLOPS', 'params%': '0.22%', 'flops%': '5.89%', 'macs%': '5.88%', 'duration%': '11.55%'}, functionals = {'conv2d': {'flops': '3.24 G', 'macs': '1.61 GMACs', 'duration': '551.94 us', 'FLOPS': '5.87 TFLOPS', 'flops%': '2.95%', 'macs%': '2.94%', 'duration%/allfuncs': '6.36%', 'duration%/e2e': '7.71%'}, 'newFunc': {'flops': '3.24 G', 'macs': '1.61 GMACs', 'duration': '651.26 us', 'FLOPS': '4.97 TFLOPS', 'flops%': '2.95%', 'macs%': '2.94%', 'duration%/allfuncs': '7.50%', 'duration%/e2e': '9.10%'}}, functionals_duration = 1.2 ms, 6, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    (1): LeakyReLU(module = {'param': '0', 'flops': '33.55 M', 'macs': '0 MACs', 'duration': '315.43 us', 'FLOPS': '106.38 GFLOPS', 'params%': '0.00%', 'flops%': '0.03%', 'macs%': '0.00%', 'duration%': '4.41%'}, functionals = {'leaky_relu': {'flops': '16.78 M', 'macs': '0 MACs', 'duration': '134.14 us', 'FLOPS': '125.07 GFLOPS', 'flops%': '0.02%', 'macs%': '0.00%', 'duration%/allfuncs': '1.55%', 'duration%/e2e': '1.87%'}, 'newFunc': {'flops': '16.78 M', 'macs': '0 MACs', 'duration': '214.02 us', 'FLOPS': '78.39 GFLOPS', 'flops%': '0.02%', 'macs%': '0.00%', 'duration%/allfuncs': '2.47%', 'duration%/e2e': '2.99%'}}, functionals_duration = 348.16 us, negative_slope=0.2, inplace=True)
    (2): Conv2d(module = {'param': '131.2 k', 'flops': '34.38 G', 'macs': '17.18 GMACs', 'duration': '800.13 us', 'FLOPS': '42.96 TFLOPS', 'params%': '4.74%', 'flops%': '31.28%', 'macs%': '31.33%', 'duration%': '11.18%'}, functionals = {'conv2d': {'flops': '17.19 G', 'macs': '8.59 GMACs', 'duration': '587.78 us', 'FLOPS': '29.24 TFLOPS', 'flops%': '15.64%', 'macs%': '15.67%', 'duration%/allfuncs': '6.77%', 'duration%/e2e': '8.21%'}, 'newFunc': {'flops': '17.19 G', 'macs': '8.59 GMACs', 'duration': '675.84 us', 'FLOPS': '25.43 TFLOPS', 'flops%': '15.64%', 'macs%': '15.67%', 'duration%/allfuncs': '7.79%', 'duration%/e2e': '9.44%'}}, functionals_duration = 1.26 ms, 64, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    (3): InstanceNorm2d(module = {'param': '0', 'flops': '67.11 M', 'macs': '0 MACs', 'duration': '468.49 us', 'FLOPS': '143.24 GFLOPS', 'params%': '0.00%', 'flops%': '0.06%', 'macs%': '0.00%', 'duration%': '6.55%'}, functionals = {'instance_norm': {'flops': '33.55 M', 'macs': '0 MACs', 'duration': '244.74 us', 'FLOPS': '137.1 GFLOPS', 'flops%': '0.03%', 'macs%': '0.00%', 'duration%/allfuncs': '2.82%', 'duration%/e2e': '3.42%'}, 'newFunc': {'flops': '33.55 M', 'macs': '0 MACs', 'duration': '334.85 us', 'FLOPS': '100.21 GFLOPS', 'flops%': '0.03%', 'macs%': '0.00%', 'duration%/allfuncs': '3.86%', 'duration%/e2e': '4.68%'}}, functionals_duration = 579.58 us, 128, eps=1e-05, momentum=0.1, affine=False, track_running_stats=False)
    (4): LeakyReLU(module = {'param': '0', 'flops': '16.78 M', 'macs': '0 MACs', 'duration': '239.61 us', 'FLOPS': '70.02 GFLOPS', 'params%': '0.00%', 'flops%': '0.02%', 'macs%': '0.00%', 'duration%': '3.35%'}, functionals = {'leaky_relu': {'flops': '8.39 M', 'macs': '0 MACs', 'duration': '74.75 us', 'FLOPS': '112.22 GFLOPS', 'flops%': '0.01%', 'macs%': '0.00%', 'duration%/allfuncs': '0.86%', 'duration%/e2e': '1.04%'}, 'newFunc': {'flops': '8.39 M', 'macs': '0 MACs', 'duration': '148.48 us', 'FLOPS': '56.5 GFLOPS', 'flops%': '0.01%', 'macs%': '0.00%', 'duration%/allfuncs': '1.71%', 'duration%/e2e': '2.07%'}}, functionals_duration = 223.23 us, negative_slope=0.2, inplace=True)
    (5): Conv2d(module = {'param': '524.54 k', 'flops': '34.37 G', 'macs': '17.18 GMACs', 'duration': '713.83 us', 'FLOPS': '48.15 TFLOPS', 'params%': '18.95%', 'flops%': '31.27%', 'macs%': '31.33%', 'duration%': '9.97%'}, functionals = {'conv2d': {'flops': '17.18 G', 'macs': '8.59 GMACs', 'duration': '490.5 us', 'FLOPS': '35.03 TFLOPS', 'flops%': '15.64%', 'macs%': '15.67%', 'duration%/allfuncs': '5.65%', 'duration%/e2e': '6.85%'}, 'newFunc': {'flops': '17.18 G', 'macs': '8.59 GMACs', 'duration': '586.75 us', 'FLOPS': '29.29 TFLOPS', 'flops%': '15.64%', 'macs%': '15.67%', 'duration%/allfuncs': '6.76%', 'duration%/e2e': '8.20%'}}, functionals_duration = 1.08 ms, 128, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    (6): InstanceNorm2d(module = {'param': '0', 'flops': '33.55 M', 'macs': '0 MACs', 'duration': '348.09 us', 'FLOPS': '96.4 GFLOPS', 'params%': '0.00%', 'flops%': '0.03%', 'macs%': '0.00%', 'duration%': '4.86%'}, functionals = {'instance_norm': {'flops': '16.78 M', 'macs': '0 MACs', 'duration': '152.58 us', 'FLOPS': '109.96 GFLOPS', 'flops%': '0.02%', 'macs%': '0.00%', 'duration%/allfuncs': '1.76%', 'duration%/e2e': '2.13%'}, 'newFunc': {'flops': '16.78 M', 'macs': '0 MACs', 'duration': '229.38 us', 'FLOPS': '73.14 GFLOPS', 'flops%': '0.02%', 'macs%': '0.00%', 'duration%/allfuncs': '2.64%', 'duration%/e2e': '3.20%'}}, functionals_duration = 381.95 us, 256, eps=1e-05, momentum=0.1, affine=False, track_running_stats=False)
    (7): LeakyReLU(module = {'param': '0', 'flops': '8.39 M', 'macs': '0 MACs', 'duration': '226.97 us', 'FLOPS': '36.96 GFLOPS', 'params%': '0.00%', 'flops%': '0.01%', 'macs%': '0.00%', 'duration%': '3.17%'}, functionals = {'leaky_relu': {'flops': '4.19 M', 'macs': '0 MACs', 'duration': '61.44 us', 'FLOPS': '68.27 GFLOPS', 'flops%': '0.00%', 'macs%': '0.00%', 'duration%/allfuncs': '0.71%', 'duration%/e2e': '0.86%'}, 'newFunc': {'flops': '4.19 M', 'macs': '0 MACs', 'duration': '134.14 us', 'FLOPS': '31.27 GFLOPS', 'flops%': '0.00%', 'macs%': '0.00%', 'duration%/allfuncs': '1.55%', 'duration%/e2e': '1.87%'}}, functionals_duration = 195.58 us, negative_slope=0.2, inplace=True)
    (8): Conv2d(module = {'param': '2.1 M', 'flops': '34.36 G', 'macs': '17.18 GMACs', 'duration': '762.46 us', 'FLOPS': '45.07 TFLOPS', 'params%': '75.79%', 'flops%': '31.27%', 'macs%': '31.33%', 'duration%': '10.65%'}, functionals = {'conv2d': {'flops': '17.18 G', 'macs': '8.59 GMACs', 'duration': '519.17 us', 'FLOPS': '33.1 TFLOPS', 'flops%': '15.63%', 'macs%': '15.67%', 'duration%/allfuncs': '5.98%', 'duration%/e2e': '7.25%'}, 'newFunc': {'flops': '17.18 G', 'macs': '8.59 GMACs', 'duration': '612.35 us', 'FLOPS': '28.06 TFLOPS', 'flops%': '15.63%', 'macs%': '15.67%', 'duration%/allfuncs': '7.06%', 'duration%/e2e': '8.56%'}}, functionals_duration = 1.13 ms, 256, 512, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    (9): InstanceNorm2d(module = {'param': '0', 'flops': '16.78 M', 'macs': '0 MACs', 'duration': '341.18 us', 'FLOPS': '49.17 GFLOPS', 'params%': '0.00%', 'flops%': '0.02%', 'macs%': '0.00%', 'duration%': '4.77%'}, functionals = {'instance_norm': {'flops': '8.39 M', 'macs': '0 MACs', 'duration': '149.5 us', 'FLOPS': '56.11 GFLOPS', 'flops%': '0.01%', 'macs%': '0.00%', 'duration%/allfuncs': '1.72%', 'duration%/e2e': '2.09%'}, 'newFunc': {'flops': '8.39 M', 'macs': '0 MACs', 'duration': '223.23 us', 'FLOPS': '37.58 GFLOPS', 'flops%': '0.01%', 'macs%': '0.00%', 'duration%/allfuncs': '2.57%', 'duration%/e2e': '3.12%'}}, functionals_duration = 372.74 us, 512, eps=1e-05, momentum=0.1, affine=False, track_running_stats=False)
    (10): LeakyReLU(module = {'param': '0', 'flops': '4.19 M', 'macs': '0 MACs', 'duration': '234.37 us', 'FLOPS': '17.9 GFLOPS', 'params%': '0.00%', 'flops%': '0.00%', 'macs%': '0.00%', 'duration%': '3.27%'}, functionals = {'leaky_relu': {'flops': '2.1 M', 'macs': '0 MACs', 'duration': '64.51 us', 'FLOPS': '32.51 GFLOPS', 'flops%': '0.00%', 'macs%': '0.00%', 'duration%/allfuncs': '0.74%', 'duration%/e2e': '0.90%'}, 'newFunc': {'flops': '2.1 M', 'macs': '0 MACs', 'duration': '138.24 us', 'FLOPS': '15.17 GFLOPS', 'flops%': '0.00%', 'macs%': '0.00%', 'duration%/allfuncs': '1.59%', 'duration%/e2e': '1.93%'}}, functionals_duration = 202.75 us, negative_slope=0.2, inplace=True)
    (11): ZeroPad2d(module = {'param': '0', 'flops': '0', 'macs': '0 MACs', 'duration': '180.72 us', 'FLOPS': '0.0 FLOPS', 'params%': '0.00%', 'flops%': '0.00%', 'macs%': '0.00%', 'duration%': '2.52%'}, functionals = {}, functionals_duration = 0.0, (1, 0, 1, 0))
    (12): Conv2d(module = {'param': '8.19 k', 'flops': '134.22 M', 'macs': '67.11 MMACs', 'duration': '1.03 ms', 'FLOPS': '129.95 GFLOPS', 'params%': '0.30%', 'flops%': '0.12%', 'macs%': '0.12%', 'duration%': '14.43%'}, functionals = {'conv2d': {'flops': '67.11 M', 'macs': '33.55 MMACs', 'duration': '802.82 us', 'FLOPS': '83.59 GFLOPS', 'flops%': '0.06%', 'macs%': '0.06%', 'duration%/allfuncs': '9.25%', 'duration%/e2e': '11.22%'}, 'newFunc': {'flops': '67.11 M', 'macs': '33.55 MMACs', 'duration': '896.0 us', 'FLOPS': '74.9 GFLOPS', 'flops%': '0.06%', 'macs%': '0.06%', 'duration%/allfuncs': '10.32%', 'duration%/e2e': '12.52%'}}, functionals_duration = 1.7 ms, 512, 1, kernel_size=(4, 4), stride=(1, 1), padding=(1, 1), bias=False)
  )
)
------------------------------------------------------------------------------
Step 5: FLOPS - 109.9 G, MACs - 54.83 GMACs, Params - 2.77 M

-------------------------- Flops Profiler --------------------------
Profile on Device: cuda:0
Profile Summary at step 5:
Notations:
data parallel size (dp_size), model parallel size(mp_size),
number of parameters (params), number of multiply-accumulate operations(MACs),
number of floating-point operations (flops), floating-point operations per second (FLOPS),
fwd latency (forward propagation latency), bwd latency (backward propagation latency),
step (weights update latency), iter latency (sum of fwd, bwd and step latency)

params per device:                                            149.33 M
params of model = params per device * mp_size:                149.33 M
fwd MACs per device:                                          3532.31 GMACs
fwd flops per device:                                         7079.63 G
fwd flops of model = fwd flops per device * mp_size:          7079.63 G
fwd latency:                                                  260.03 ms
fwd FLOPS per device = fwd flops per device / fwd latency:    27.23 TFLOPS

----------------------------- Aggregated Profile per Device -----------------------------
Top 1 modules in terms of params, flops, MACs or duration at different model depths:
depth 0:
    params      - {'Gbase': '149.33 M'}
    flops       - {'Gbase': '7079.63 G'}
    MACs        - {'Gbase': '3532.31 GMACs'}
    fwd latency - {'Gbase': '260.03 ms'}
depth 1:
    params      - {'G3d': '48.56 M'}
    flops       - {'Eapp': '2707.29 G'}
    MACs        - {'Eapp': '1351.3 GMACs'}
    fwd latency - {'Emtn': '171.0 ms'}
depth 2:
    params      - {'Sequential': '99.76 M'}
    flops       - {'Sequential': '2606.6 G'}
    MACs        - {'Sequential': '1300.07 GMACs'}
    fwd latency - {'Sequential': '113.98 ms'}
depth 3:
    params      - {'ResBlock3D': '48.32 M'}
    flops       - {'Sequential': '1521.42 G'}
    MACs        - {'Sequential': '758.2 GMACs'}
    fwd latency - {'Sequential': '117.84 ms'}
depth 4:
    params      - {'Conv3d': '62.76 M'}
    flops       - {'Conv2d': '1449.83 G'}
    MACs        - {'Conv2d': '724.78 GMACs'}
    fwd latency - {'BasicBlock': '93.86 ms'}
depth 5:
    params      - {'Conv2d': '29.97 M'}
    flops       - {'Conv2d': '1530.98 G'}
    MACs        - {'Conv2d': '765.44 GMACs'}
    fwd latency - {'Conv2d': '52.09 ms'}

------------------------------------------------------------------------------
Step 5: FLOPS - 109.9 G, MACs - 54.83 GMACs, Params - 2.77 M

-------------------------- Flops Profiler --------------------------
Profile on Device: cuda:0
Profile Summary at step 5:
Notations:
data parallel size (dp_size), model parallel size(mp_size),
number of parameters (params), number of multiply-accumulate operations(MACs),
number of floating-point operations (flops), floating-point operations per second (FLOPS),
fwd latency (forward propagation latency), bwd latency (backward propagation latency),
step (weights update latency), iter latency (sum of fwd, bwd and step latency)

params per device:                                            149.33 M
params of model = params per device * mp_size:                149.33 M
fwd MACs per device:                                          3532.31 GMACs
fwd flops per device:                                         7079.63 G
fwd flops of model = fwd flops per device * mp_size:          7079.63 G
fwd latency:                                                  260.03 ms
fwd FLOPS per device = fwd flops per device / fwd latency:    27.23 TFLOPS

out.txt