Will OperationProfiler underestimate the backward timing?

Hi @geoffxy, Thanks for this awsome project! I found the first element of args in measure_operation_ms at here, https://github.com/skylineprof/skyline/blob/master/cli/skyline/profiler/operation.py#L18 is torch.Tensor. Will the backward timing measure count the computation time for caculating the gradients with respect to this first argument (input)?

I have create the following script to test. If we didn't wrap the inputs as nn.Parameter, the backward time is roughly equal to forward time, which seems counter intuitive to me. If the wrap the inputs as nn.Parameter, then backward takes roughly twice as forward cost, which seems correct.

from skyline.profiler.operation import OperationProfiler
import torch.nn.functional as F 
import torch 
import torch.nn as nn
import numpy as np

def main():
    """"""
    bs = 2048
    in_feature = 1024
    out_feature = 1024

    std_dev = np.sqrt(2 / (in_feature + out_feature))
    weights = np.random.normal(0, std_dev, size=(out_feature, in_feature)).astype(np.float32)
    std_dev = np.sqrt(1 / out_feature)
    bias = np.random.normal(0, std_dev, size=out_feature).astype(np.float32)

    weights = nn.Parameter(torch.tensor(weights, device='cuda'), requires_grad=True)
    bias = nn.Parameter(torch.tensor(bias, device='cuda'), requires_grad=True)

    inputs = torch.rand((bs, in_feature)).cuda()
    inputs.requires_grad_ = True
    # inputs = nn.Parameter(inputs, requires_grad=True)

    op_prof = OperationProfiler(warm_up=10, measure_for=20)
    fwdt, bwdt = op_prof.measure_operation_ms(F.linear, (inputs, weights, bias), {})
    print('fwd', fwdt, 'ms', '; bwd', bwdt, 'ms')

if __name__ == "__main__":
    main()

Hi @zarzen!

Thanks for trying Skyline!

I think there might be a problem in your code where the input tensor did not have requires_grad set to True. The line inputs.requires_grad_ = True should actually be inputs.requires_grad_().

With this modified script:

from skyline.profiler.operation import OperationProfiler
import torch.nn.functional as F
import torch
import torch.nn as nn
import numpy as np

def main():
    """"""
    bs = 2048
    in_feature = 1024
    out_feature = 1024

    std_dev = np.sqrt(2 / (in_feature + out_feature))
    weights = np.random.normal(0, std_dev, size=(out_feature, in_feature)).astype(np.float32)
    std_dev = np.sqrt(1 / out_feature)
    bias = np.random.normal(0, std_dev, size=out_feature).astype(np.float32)

    weights = nn.Parameter(torch.tensor(weights, device='cuda'), requires_grad=True)
    bias = nn.Parameter(torch.tensor(bias, device='cuda'), requires_grad=True)

    op_prof = OperationProfiler(warm_up=10, measure_for=20)

    print('PyTorch version:', torch.__version__)
    print('GPU:', torch.cuda.get_device_name())
    print('---')

    print('inputs.requires_grad_()')
    inputs = torch.rand((bs, in_feature)).cuda()
    inputs.requires_grad_() # <-------------------- This line
    fwdt, bwdt = op_prof.measure_operation_ms(F.linear, (inputs, weights, bias), {})
    print('fwd', fwdt, 'ms', '; bwd', bwdt, 'ms')

    print('---')
    print('inputs = nn.Parameter(inputs, requires_grad=True)')
    inputs = torch.rand((bs, in_feature)).cuda()
    inputs = nn.Parameter(inputs, requires_grad=True)
    fwdt, bwdt = op_prof.measure_operation_ms(F.linear, (inputs, weights, bias), {})
    print('fwd', fwdt, 'ms', '; bwd', bwdt, 'ms')

if __name__ == "__main__":
    main()

I get:

PyTorch version: 1.6.0
GPU: GeForce RTX 2070
---
inputs.requires_grad_()
fwd 0.8831232070922852 ms ; bwd 1.5012399673461914 ms
---
inputs = nn.Parameter(inputs, requires_grad=True)
fwd 0.8678496360778809 ms ; bwd 1.472152042388916 ms

which is what I think you expected to see?

What the OperationProfiler does when measuring the backward pass for an output tensor o is that it measures the time it takes to run all* the gradient functions in the backward graph, starting from o.grad_fn to the leaf tensors. Since inputs.requires_grad_ = True doesn't actually set the inputs tensor to have inputs.requires_grad == True, the backward pass does not propagate the gradient to the inputs tensor. This means that there would be one fewer matrix multiplication needed, which would explain why you saw a similar run time for the forward and backward passes.

*By default it also excludes any AccumulateGrads in the backward graph, but that would not have been the cause of the discrepancy that you saw.

skylineprof / skyline

Will OperationProfiler underestimate the backward timing? #60