Closed zarzen closed 4 years ago
Hi @zarzen!
Thanks for trying Skyline!
I think there might be a problem in your code where the input tensor did not have requires_grad
set to True
. The line inputs.requires_grad_ = True
should actually be inputs.requires_grad_()
.
With this modified script:
from skyline.profiler.operation import OperationProfiler
import torch.nn.functional as F
import torch
import torch.nn as nn
import numpy as np
def main():
""""""
bs = 2048
in_feature = 1024
out_feature = 1024
std_dev = np.sqrt(2 / (in_feature + out_feature))
weights = np.random.normal(0, std_dev, size=(out_feature, in_feature)).astype(np.float32)
std_dev = np.sqrt(1 / out_feature)
bias = np.random.normal(0, std_dev, size=out_feature).astype(np.float32)
weights = nn.Parameter(torch.tensor(weights, device='cuda'), requires_grad=True)
bias = nn.Parameter(torch.tensor(bias, device='cuda'), requires_grad=True)
op_prof = OperationProfiler(warm_up=10, measure_for=20)
print('PyTorch version:', torch.__version__)
print('GPU:', torch.cuda.get_device_name())
print('---')
print('inputs.requires_grad_()')
inputs = torch.rand((bs, in_feature)).cuda()
inputs.requires_grad_() # <-------------------- This line
fwdt, bwdt = op_prof.measure_operation_ms(F.linear, (inputs, weights, bias), {})
print('fwd', fwdt, 'ms', '; bwd', bwdt, 'ms')
print('---')
print('inputs = nn.Parameter(inputs, requires_grad=True)')
inputs = torch.rand((bs, in_feature)).cuda()
inputs = nn.Parameter(inputs, requires_grad=True)
fwdt, bwdt = op_prof.measure_operation_ms(F.linear, (inputs, weights, bias), {})
print('fwd', fwdt, 'ms', '; bwd', bwdt, 'ms')
if __name__ == "__main__":
main()
I get:
PyTorch version: 1.6.0
GPU: GeForce RTX 2070
---
inputs.requires_grad_()
fwd 0.8831232070922852 ms ; bwd 1.5012399673461914 ms
---
inputs = nn.Parameter(inputs, requires_grad=True)
fwd 0.8678496360778809 ms ; bwd 1.472152042388916 ms
which is what I think you expected to see?
What the OperationProfiler
does when measuring the backward pass for an output tensor o
is that it measures the time it takes to run all* the gradient functions in the backward graph, starting from o.grad_fn
to the leaf tensors. Since inputs.requires_grad_ = True
doesn't actually set the inputs
tensor to have inputs.requires_grad == True
, the backward pass does not propagate the gradient to the inputs
tensor. This means that there would be one fewer matrix multiplication needed, which would explain why you saw a similar run time for the forward and backward passes.
*By default it also excludes any AccumulateGrad
s in the backward graph, but that would not have been the cause of the discrepancy that you saw.
I see, Thanks for your clarification!
Hi @geoffxy, Thanks for this awsome project! I found the first element of
args
inmeasure_operation_ms
at here, https://github.com/skylineprof/skyline/blob/master/cli/skyline/profiler/operation.py#L18 istorch.Tensor
. Will the backward timing measure count the computation time for caculating the gradients with respect to this first argument (input)?I have create the following script to test. If we didn't wrap the
inputs
asnn.Parameter
, the backward time is roughly equal to forward time, which seems counter intuitive to me. If the wrap theinputs
asnn.Parameter
, then backward takes roughly twice as forward cost, which seems correct.