JIT: Tracing faster than scripting

mys007 commented 4 years ago

🐛 Bug

Calling a traced module in a for-loop with constant number of iterations from a scripted module is slower than tracing, at least with CUDA.

To Reproduce

import time
import numpy as np
import torch
import torch.nn as nn
import torchvision.models as tvmodels

class MyModule(nn.Module):

    def __init__(self, trace):
        super().__init__()
        self.worker_module = tvmodels.resnet18().cuda()
        if trace:
            with torch.no_grad():
                self.worker_module = torch.jit.trace(self.worker_module, torch.zeros(1,3,256,256,device='cuda'))

    def forward(self, z):
        x = self.worker_module(z)
        for i in range(10):
            x += self.worker_module(z)
        return x

if __name__ == "__main__":

    input = torch.zeros(1,3,512,512,device='cuda')       
    model = MyModule(trace=False).cuda()

    torch.cuda.synchronize()
    time_spent = []
    for i in range(1000):
        start_time = time.perf_counter()
        with torch.no_grad():
            model(input)
        torch.cuda.synchronize()
        time_spent.append(time.perf_counter()- start_time)
    print('Avg time (ms): {:.3f} +- {:.3f}'.format(np.mean(time_spent)*1000, np.std(time_spent)*1000))    

    model_J = MyModule(trace=True).cuda()    
    with torch.no_grad():  
        JIT_model = torch.jit.trace(model_J, torch.zeros(input.shape,device='cuda')).cuda()
    #print(JIT_model.code)

    torch.cuda.synchronize()
    time_spent = []
    for i in range(1000):
        start_time = time.perf_counter()
        with torch.no_grad():
            JIT_model(input)
        torch.cuda.synchronize()
        time_spent.append(time.perf_counter()- start_time)
    print('Avg traced time (ms): {:.3f} +- {:.3f}'.format(np.mean(time_spent)*1000, np.std(time_spent)*1000))    

    model_J = MyModule(trace=True).cuda()    
    with torch.no_grad():  
        JIT_model = torch.jit.script(model_J).cuda()
    #print(JIT_model.code)

    torch.cuda.synchronize()
    time_spent = []
    for i in range(1000):
        start_time = time.perf_counter()
        with torch.no_grad():
            JIT_model(input)
        torch.cuda.synchronize()
        time_spent.append(time.perf_counter()- start_time)
    print('Avg scripted time (ms): {:.3f} +- {:.3f}'.format(np.mean(time_spent)*1000, np.std(time_spent)*1000))

Running it on RTX 2080 Ti gives me:

Avg time (ms): 33.303 +- 9.448
Avg traced time (ms): 27.652 +- 0.468
Avg scripted time (ms): 29.582 +- 1.361

The scripted model is slower and has a less uniform running time.

Expected behavior

Tracing and scripting should produce comparable running times.

Environment

PyTorch version: 1.4.0 Is debug build: No CUDA used to build PyTorch: 10.1

OS: Ubuntu 18.04.2 LTS GCC version: (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0 CMake version: version 3.10.2

Python version: 3.7 Is CUDA available: Yes CUDA runtime version: 10.0.130 GPU models and configuration: GPU 0: GeForce RTX 2080 Ti GPU 1: GeForce RTX 2080 Ti GPU 2: GeForce RTX 2080 Ti GPU 3: GeForce RTX 2080 Ti

Nvidia driver version: 440.33.01 cuDNN version: Could not collect

Versions of relevant libraries: [pip] inferno-pytorch==0.3.1 [pip] numpy==1.16.2 [pip] pytorch-memlab==0.0.4 [pip] robust-loss-pytorch==0.0.2 [pip] torch==1.4.0 [pip] torch-dct==0.1.5 [pip] torchfile==0.1.0 [pip] torchvision==0.5.0 [conda] blas 1.0 mkl [conda] cuda100 1.0 0 pytorch [conda] inferno-pytorch 0.3.1 dev_0 [conda] mkl 2019.1 144 [conda] mkl_fft 1.0.10 py37ha843d7b_0 [conda] mkl_random 1.0.2 py37hd81dba3_0 [conda] pytorch 1.4.0 py3.7_cuda10.1.243_cudnn7.6.3_0 pytorch [conda] pytorch-memlab 0.0.4 pypi_0 pypi [conda] robust-loss-pytorch 0.0.2 pypi_0 pypi [conda] torch-dct 0.1.5 pypi_0 pypi [conda] torchfile 0.1.0 pypi_0 pypi [conda] torchvision 0.5.0 py37_cu101 pytorch

cc @suo

suo commented 4 years ago

This is not entirely unexpected; scripting preserves actual loop semantics while tracing unrolls all loops (since tracing merely observing what happened when you ran your model on the example inputs). Generally the overhead from scripting is negligible, since tensor operations dominate wall time.

We do have work in progress to close this gap even in cases where overhead is important (models with lots of small tensor ops, models with lots of scalar math, etc.), but it's not fully landed yet.

mys007 commented 4 years ago

Thanks for the reply! Indeed, the scripted module doesn't have the loop unrolled. But I'm wondering where does the 7% running time overhead come from, as the branch condition is deterministic and don't involve any Tensor operations one has to synchronize for... I'm not sure operation fusion in the upcoming work as you've suggested will address this?

pytorch / pytorch