pytorch / TensorRT

PyTorch/TorchScript/FX compiler for NVIDIA GPUs using TensorRT
https://pytorch.org/TensorRT
BSD 3-Clause "New" or "Revised" License
2.57k stars 350 forks source link

🐛 [Bug] Multi-GPU model moved to single GPU #2269

Closed austinapatel closed 1 year ago

austinapatel commented 1 year ago

Bug Description

I'm experimenting using TorchTRT with a model partitioned across two GPUs using pipeline parallelism techniques. Half of my network is on GPU0 and the second half is on GPU1. When executing the model in PyTorch eager mode, I see kernels for each layer executing on their assigned GPU as expected. When I compile my network with TorchTRT, the network is moved to only one of the GPUs and the network is then executed on that device, rather than being split across GPUs. This limits the ability of being able to use TorchTRT with very large models that don't fit within the memory of a single GPU.

To Reproduce

Steps to reproduce the behavior:

  1. Run the profiling command: nsys profile --trace cuda,nvtx --sample cpu --force-overwrite true --output profiling_results/tmp --gpu-metrics-device=all --gpu-metrics-frequency=20000 python torchtrt_multigpu_issue.py

torchtrt_multigpu_issue.py:

import torch
from torch import nn
from torch.nn import functional as F
import torch_tensorrt

"""NETWORK"""
class SimpleNetwork(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(10, 100).to('cuda:0')
        self.fc2 = nn.Linear(100, 5).to('cuda:0')
        self.mha = nn.MultiheadAttention(embed_dim=5, num_heads=1).to('cuda:1')

    def forward(self, x):
        x = x.to('cuda:0')
        x = self.fc1(x)
        x = F.relu(x)
        x = self.fc2(x)
        x = F.relu(x)
        x = x.to('cuda:1')
        x, _ = self.mha(x, x, x)
        x = F.relu(x)
        return x

"""PROFILING UTILITIES"""
def timed(fn, msg):
    torch.cuda.synchronize()
    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)
    torch.cuda.nvtx.mark(f'timed start: {msg}')
    start.record()
    result = fn()
    end.record()
    torch.cuda.nvtx.mark(f'timed end: {msg}')
    torch.cuda.synchronize()
    return result, start.elapsed_time(end)

def profile(method, nvtx_msg=None):
    # profiling
    print('Profiling with torch NVTX markers enabled')
    torch.cuda.cudart().cudaProfilerStart()
    with torch.autograd.profiler.emit_nvtx(record_shapes=True):
        if nvtx_msg:
            torch.cuda.nvtx.mark(f'profile start: {nvtx_msg}')
        result = method()
        if nvtx_msg:
            torch.cuda.nvtx.mark(f'profile end: {nvtx_msg}')
    torch.cuda.cudart().cudaProfilerStop()

    return result

def run_two_models(model1, model2, inp):
    torch.cuda.synchronize()

    output1, time1 = timed(lambda: model1(inp), f'profiling model1')
    output2, time2 = timed(lambda: model2(inp), f'profiling model2')

    return time1, time2

"""MAIN"""
if __name__ == '__main__':
    inp = torch.ones((10,10), device='cuda:0')

    # Eager
    eager_model = SimpleNetwork()
    eager_model.eval()
    print(f'Calling eager model')
    eager_model(inp)

    # TorchTRT
    torch._dynamo.reset()
    trt_model = torch_tensorrt.compile(eager_model, ir="torch_compile", inputs=inp, use_python_runtime=False)
    print(f'Calling trt_model')
    trt_model(inp)

    # Profiling
    time_eager, time_trt = profile(lambda: run_two_models(eager_model, trt_model, inp))
    print(f'Eager time: {time_eager}ms\nTime TRT: {time_trt}ms')

Output:

INFO:torch_tensorrt.dynamo.utils:Using Default Torch-TRT Runtime (as requested by user)
INFO:torch_tensorrt.dynamo.utils:Compilation Settings: CompilationSettings(precision=torch.float32, debug=False, workspace_size=0, min_block_size=5, torch_executed_ops=set(), pass_through_build_failures=False, max_aux_streams=None, version_compatible=False, optimization_level=None, use_python_runtime=False, truncate_long_and_double=False, use_fast_partitioner=True, enable_experimental_decompositions=False)

INFO:torch_tensorrt.dynamo.conversion._TRTInterpreter:TRT INetwork construction elapsed time: 0:00:00.036864
INFO:torch_tensorrt.dynamo.conversion._TRTInterpreter:Build TRT engine elapsed time: 0:00:12.406040
INFO:torch_tensorrt.dynamo.conversion._TRTInterpreter:TRT Engine uses: 10752 bytes of Memory
INFO:torch_tensorrt.dynamo.conversion._TRTInterpreter:TRT INetwork construction elapsed time: 0:00:00.002983
INFO:torch_tensorrt.dynamo.conversion._TRTInterpreter:Build TRT engine elapsed time: 0:00:00.182259
INFO:torch_tensorrt.dynamo.conversion._TRTInterpreter:TRT Engine uses: 1024 bytes of Memory
WARNING: [Torch-TensorRT] - Input 5 of engine _run_on_acc_1_engine was found to be on cuda:1 but should be on cuda:0. This tensor is being moved by the runtime but for performance considerations, ensure your inputs are all on GPU and open an issue here (https://github.com/pytorch/TensorRT/issues) if this warning persists.
WARNING: [Torch-TensorRT] - Input 6 of engine _run_on_acc_1_engine was found to be on cuda:1 but should be on cuda:0. This tensor is being moved by the runtime but for performance considerations, ensure your inputs are all on GPU and open an issue here (https://github.com/pytorch/TensorRT/issues) if this warning persists.
WARNING: [Torch-TensorRT] - Input 7 of engine _run_on_acc_1_engine was found to be on cuda:1 but should be on cuda:0. This tensor is being moved by the runtime but for performance considerations, ensure your inputs are all on GPU and open an issue here (https://github.com/pytorch/TensorRT/issues) if this warning persists.
WARNING: [Torch-TensorRT] - Input 8 of engine _run_on_acc_1_engine was found to be on cuda:1 but should be on cuda:0. This tensor is being moved by the runtime but for performance considerations, ensure your inputs are all on GPU and open an issue here (https://github.com/pytorch/TensorRT/issues) if this warning persists.
WARNING: [Torch-TensorRT] - Input 9 of engine _run_on_acc_1_engine was found to be on cuda:1 but should be on cuda:0. This tensor is being moved by the runtime but for performance considerations, ensure your inputs are all on GPU and open an issue here (https://github.com/pytorch/TensorRT/issues) if this warning persists.
WARNING: [Torch-TensorRT] - Input 10 of engine _run_on_acc_1_engine was found to be on cuda:1 but should be on cuda:0. This tensor is being moved by the runtime but for performance considerations, ensure your inputs are all on GPU and open an issue here (https://github.com/pytorch/TensorRT/issues) if this warning persists.
WARNING: [Torch-TensorRT] - Input 11 of engine _run_on_acc_1_engine was found to be on cuda:1 but should be on cuda:0. This tensor is being moved by the runtime but for performance considerations, ensure your inputs are all on GPU and open an issue here (https://github.com/pytorch/TensorRT/issues) if this warning persists.
WARNING: [Torch-TensorRT] - Input 12 of engine _run_on_acc_1_engine was found to be on cuda:1 but should be on cuda:0. This tensor is being moved by the runtime but for performance considerations, ensure your inputs are all on GPU and open an issue here (https://github.com/pytorch/TensorRT/issues) if this warning persists.
WARNING: [Torch-TensorRT] - Input 5 of engine _run_on_acc_1_engine was found to be on cuda:1 but should be on cuda:0. This tensor is being moved by the runtime but for performance considerations, ensure your inputs are all on GPU and open an issue here (https://github.com/pytorch/TensorRT/issues) if this warning persists.
WARNING: [Torch-TensorRT] - Input 6 of engine _run_on_acc_1_engine was found to be on cuda:1 but should be on cuda:0. This tensor is being moved by the runtime but for performance considerations, ensure your inputs are all on GPU and open an issue here (https://github.com/pytorch/TensorRT/issues) if this warning persists.
WARNING: [Torch-TensorRT] - Input 7 of engine _run_on_acc_1_engine was found to be on cuda:1 but should be on cuda:0. This tensor is being moved by the runtime but for performance considerations, ensure your inputs are all on GPU and open an issue here (https://github.com/pytorch/TensorRT/issues) if this warning persists.
WARNING: [Torch-TensorRT] - Input 8 of engine _run_on_acc_1_engine was found to be on cuda:1 but should be on cuda:0. This tensor is being moved by the runtime but for performance considerations, ensure your inputs are all on GPU and open an issue here (https://github.com/pytorch/TensorRT/issues) if this warning persists.
WARNING: [Torch-TensorRT] - Input 9 of engine _run_on_acc_1_engine was found to be on cuda:1 but should be on cuda:0. This tensor is being moved by the runtime but for performance considerations, ensure your inputs are all on GPU and open an issue here (https://github.com/pytorch/TensorRT/issues) if this warning persists.
WARNING: [Torch-TensorRT] - Input 10 of engine _run_on_acc_1_engine was found to be on cuda:1 but should be on cuda:0. This tensor is being moved by the runtime but for performance considerations, ensure your inputs are all on GPU and open an issue here (https://github.com/pytorch/TensorRT/issues) if this warning persists.
WARNING: [Torch-TensorRT] - Input 11 of engine _run_on_acc_1_engine was found to be on cuda:1 but should be on cuda:0. This tensor is being moved by the runtime but for performance considerations, ensure your inputs are all on GPU and open an issue here (https://github.com/pytorch/TensorRT/issues) if this warning persists.
WARNING: [Torch-TensorRT] - Input 12 of engine _run_on_acc_1_engine was found to be on cuda:1 but should be on cuda:0. This tensor is being moved by the runtime but for performance considerations, ensure your inputs are all on GPU and open an issue here (https://github.com/pytorch/TensorRT/issues) if this warning persists.
Calling eager model
Calling trt_model
Profiling with torch NVTX markers enabled
Eager time: 1.8095359802246094ms
Time TRT: 3.959264039993286ms

Note if you are TorchTRT debug build you might get an error about profiling already being enabled. In this case you can either:

  1. Replace time_eager, time_trt = profile(lambda: run_two_models(eager_model, trt_model, inp)) with time_eager, time_trt = run_two_models(eager_model, trt_model, inp)
  2. Change https://github.com/pytorch/TensorRT/blob/b774440657f3f998b785a65fe7504e8efc8c67b6/setup.py#L141 from dbg to opt and re-build TorchTRT.

Expected behavior

I was hoping that separate TRT engines would be created and executed on each GPU, rather than having everything moved to a single GPU and have engines executed there. I'm curious to know if there is a workaround for this, or if there is a more fundamental limitation that could prevent this from being possible. When using PyTorch Inductor backend, I am able to create compiled components that leverage both GPUs, which is the desired behavior I expected.

Environment

Additional context

NSYS trace for using TorchTRT. Note that all layers are copied to one GPU and all engines execute on that GPU. multigpu_trt_annotated

NSYS trace for using Torch Inductor. Note that the model parallelism is followed and both GPUs are utilized. multigpu_inductor_annotated

gs-olive commented 1 year ago

Hi - in a very recent PR, we added the capability to specify device as a compilation argument. Could you try specifying the device either via string or torch.device object to each compilation, and see if that fixes it?