Closed austinapatel closed 1 year ago
Hi - in a very recent PR, we added the capability to specify device
as a compilation argument. Could you try specifying the device either via string or torch.device
object to each compilation, and see if that fixes it?
Bug Description
I'm experimenting using TorchTRT with a model partitioned across two GPUs using pipeline parallelism techniques. Half of my network is on GPU0 and the second half is on GPU1. When executing the model in PyTorch eager mode, I see kernels for each layer executing on their assigned GPU as expected. When I compile my network with TorchTRT, the network is moved to only one of the GPUs and the network is then executed on that device, rather than being split across GPUs. This limits the ability of being able to use TorchTRT with very large models that don't fit within the memory of a single GPU.
To Reproduce
Steps to reproduce the behavior:
nsys profile --trace cuda,nvtx --sample cpu --force-overwrite true --output profiling_results/tmp --gpu-metrics-device=all --gpu-metrics-frequency=20000 python torchtrt_multigpu_issue.py
torchtrt_multigpu_issue.py
:Output:
Note if you are TorchTRT debug build you might get an error about profiling already being enabled. In this case you can either:
time_eager, time_trt = profile(lambda: run_two_models(eager_model, trt_model, inp))
withtime_eager, time_trt = run_two_models(eager_model, trt_model, inp)
dbg
toopt
and re-build TorchTRT.Expected behavior
I was hoping that separate TRT engines would be created and executed on each GPU, rather than having everything moved to a single GPU and have engines executed there. I'm curious to know if there is a workaround for this, or if there is a more fundamental limitation that could prevent this from being possible. When using PyTorch Inductor backend, I am able to create compiled components that leverage both GPUs, which is the desired behavior I expected.
Environment
65c6494ce3107d33b27fdb1630ac7982f8649382
built from source2.1.0a0+4136153
(commit7682252cac9ed31055d2ae950cf6942dd311da73
)conda
,pip
,libtorch
, source): sourceAdditional context
NSYS trace for using TorchTRT. Note that all layers are copied to one GPU and all engines execute on that GPU.
NSYS trace for using Torch Inductor. Note that the model parallelism is followed and both GPUs are utilized.