Closed HolyWu closed 2 months ago
The same for aten.leaky_relu.
import torch
import torch.nn as nn
import torch_tensorrt
class MyModule(nn.Module):
def __init__(self):
super().__init__()
self.m = nn.LeakyReLU()
def forward(self, x):
return self.m(x)
model = MyModule().eval().cuda().half()
inputs = [torch.randn((1, 3, 4, 4), dtype=torch.half, device="cuda")]
optimized_model = torch_tensorrt.compile(
model,
ir="dynamo",
inputs=inputs,
enabled_precisions={torch.half},
debug=True,
min_block_size=1,
)
DEBUG:torch_tensorrt.dynamo.partitioning._global_partitioner:
Supported Nodes:
- torch.ops.aten._to_copy.default + Operator Count: 2
- torch.ops.aten.gt.Scalar + Operator Count: 1
- torch.ops.aten.mul.Tensor + Operator Count: 1
- torch.ops.aten.where.self + Operator Count: 1
DEBUG:torch_tensorrt.dynamo.partitioning._global_partitioner:
All Nodes Supported
DEBUG:torch_tensorrt.dynamo._compiler:Detected support for 5 operators out of 5 in subgraph.
DEBUG:torch_tensorrt.dynamo.partitioning._adjacency_partitioner:
Number of TensorRT-Accelerated Engines Generated: 1
DEBUG:torch_tensorrt.dynamo.partitioning._adjacency_partitioner:
Supported Nodes:
- torch.ops.aten._to_copy.default + Operator Count: 2
- torch.ops.aten.gt.Scalar + Operator Count: 1
- torch.ops.aten.mul.Tensor + Operator Count: 1
- torch.ops.aten.where.self + Operator Count: 1
DEBUG:torch_tensorrt.dynamo.partitioning._adjacency_partitioner:
All Nodes Supported
++++++++++++++++++++++++++++++++++++++++++++++++++ Dry-Run Results for Graph ++++++++++++++++++++++++++++++++++++++++++++++++++
The graph consists of 5 Total Operators, of which 5 operators are supported, 100.0% coverage
Compiled with: CompilationSettings(precision=torch.float16, debug=True, workspace_size=0, min_block_size=1, torch_executed_ops=set(), pass_through_build_failures=False, max_aux_streams=None, version_compatible=False, optimization_level=None, use_python_runtime=False, truncate_long_and_double=False, use_fast_partitioner=True, enable_experimental_decompositions=False, device=Device(type=DeviceType.GPU, gpu_id=0), require_full_compilation=False, disable_tf32=False, sparse_weights=False, refit=False, engine_capability=<EngineCapability.DEFAULT: 0>, num_avg_timing_iters=1, dla_sram_size=1048576, dla_local_dram_size=1073741824, dla_global_dram_size=536870912, dryrun=False, hardware_compatible=False, output_format='exported_program')
Graph Structure:
Inputs: List[Tensor: (1, 3, 4, 4)@float16]
...
TRT Engine #1 - Submodule name: _run_on_acc_0
Engine Inputs: List[Tensor: (1, 3, 4, 4)@float16]
Number of Operators in Engine: 5
Engine Outputs: Tensor: (1, 3, 4, 4)@float16
...
Outputs: List[Tensor: (1, 3, 4, 4)@float16]
------------------------- Aggregate Stats -------------------------
Average Number of Operators per TRT Engine: 5.0
Most Operators in a TRT Engine: 5
********** Recommendations **********
- For minimal graph segmentation, select min_block_size=5 which would generate 1 TRT engine(s)
- The current level of graph segmentation is equivalent to selecting min_block_size=5 which generates 1 TRT engine(s)
WARNING: [Torch-TensorRT] - Using default stream in enqueue()/enqueueV2()/enqueueV3() may lead to performance issues due to additional cudaDeviceSynchronize() calls by TensorRT to ensure correct synchronizations. Please use non-default stream instead.
C:\Python311\Lib\site-packages\torch\export\exported_program.py:740: UserWarning: Unable to execute the generated python source code from the graph. The graph module will no longer be directly callable, but you can still run the ExportedProgram, and if needed, you can run the graph module eagerly using torch.fx.Interpreter.
warnings.warn(
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch_tensorrt
class MyModule(nn.Module):
def __init__(self):
super().__init__()
def forward(self, x):
return F.interpolate(x, scale_factor=2, mode="bilinear", align_corners=True)
model = MyModule().eval().cuda().half()
inputs = [
torch.randn((1, 3, 128, 128), dtype=torch.half, device="cuda"),
]
optimized_model = torch_tensorrt.compile(
model,
ir="dynamo",
inputs=inputs,
enabled_precisions={torch.half},
debug=True,
min_block_size=1,
)
DEBUG:torch_tensorrt.dynamo.partitioning._global_partitioner:
Supported Nodes:
- torch.ops.aten._to_copy.default + Operator Count: 2
- torch.ops.aten.index.Tensor + Operator Count: 4
- torch.ops.aten.sub.Tensor + Operator Count: 3
- torch.ops.aten.mul.Tensor + Operator Count: 3
- torch.ops.aten.add.Tensor + Operator Count: 3
DEBUG:torch_tensorrt.dynamo.partitioning._global_partitioner:
All Nodes Supported
DEBUG:torch_tensorrt.dynamo._compiler:Detected support for 15 operators out of 15 in subgraph.
DEBUG:torch_tensorrt.dynamo.partitioning._adjacency_partitioner:
Number of TensorRT-Accelerated Engines Generated: 1
DEBUG:torch_tensorrt.dynamo.partitioning._adjacency_partitioner:
Supported Nodes:
- torch.ops.aten._to_copy.default + Operator Count: 2
- torch.ops.aten.index.Tensor + Operator Count: 4
- torch.ops.aten.sub.Tensor + Operator Count: 3
- torch.ops.aten.mul.Tensor + Operator Count: 3
- torch.ops.aten.add.Tensor + Operator Count: 3
DEBUG:torch_tensorrt.dynamo.partitioning._adjacency_partitioner:
All Nodes Supported
++++++++++++++++++++++++++++++++++++++++++++++++++ Dry-Run Results for Graph ++++++++++++++++++++++++++++++++++++++++++++++++++
The graph consists of 15 Total Operators, of which 15 operators are supported, 100.0% coverage
Compiled with: CompilationSettings(precision=torch.float16, debug=True, workspace_size=0, min_block_size=1, torch_executed_ops=set(), pass_through_build_failures=False, max_aux_streams=None, version_compatible=False, optimization_level=None, use_python_runtime=False, truncate_long_and_double=True, use_fast_partitioner=True, enable_experimental_decompositions=False, device=Device(type=DeviceType.GPU, gpu_id=0), require_full_compilation=False, disable_tf32=False, sparse_weights=False, refit=False, engine_capability=<EngineCapability.DEFAULT: 0>, num_avg_timing_iters=1, dla_sram_size=1048576, dla_local_dram_size=1073741824, dla_global_dram_size=536870912, dryrun=False, hardware_compatible=False, output_format='exported_program')
Graph Structure:
Inputs: List[Tensor: (1, 3, 128, 128)@float16]
...
TRT Engine #1 - Submodule name: _run_on_acc_0
Engine Inputs: List[Tensor: (1, 3, 128, 128)@float16]
Number of Operators in Engine: 15
Engine Outputs: Tensor: (1, 3, 256, 256)@float16
...
Outputs: List[Tensor: (1, 3, 256, 256)@float16]
------------------------- Aggregate Stats -------------------------
Average Number of Operators per TRT Engine: 15.0
Most Operators in a TRT Engine: 15
********** Recommendations **********
- For minimal graph segmentation, select min_block_size=15 which would generate 1 TRT engine(s)
- The current level of graph segmentation is equivalent to selecting min_block_size=15 which generates 1 TRT engine(s)
WARNING: [Torch-TensorRT] - Using default stream in enqueue()/enqueueV2()/enqueueV3() may lead to performance issues due to additional cudaDeviceSynchronize() calls by TensorRT to ensure correct synchronizations. Please use non-default stream instead.
C:\Python311\Lib\site-packages\torch\export\exported_program.py:740: UserWarning: Unable to execute the generated python source code from the graph. The graph module will no longer be directly callable, but you can still run the ExportedProgram, and if needed, you can run the graph module eagerly using torch.fx.Interpreter.
warnings.warn(
❓ Question
From the debug log below, it seems that the
aten.grid_sampler_2d
operator gets decomposed into several lower-level operators. But isn't there a corresponding converter which should be used?What you have already tried
Environment
conda
,pip
,libtorch
, source): pip