thinvy / DepthAnythingTensorrtDeploy

NVIDIA TensorRT deployment of Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data.
Apache License 2.0
18 stars 3 forks source link

TensorRT加速效果并不明显 #3

Open zzzzzyh111 opened 3 months ago

zzzzzyh111 commented 3 months ago

感谢您的优秀工作! 最近我在尝试在Jetson Orign NX上使用TensorRT对Depth Anything进行加速,但是我发现转换后的trt文件的推理速度和onnx文件相比并没有显著提升,甚至还有下降。其中:

ONNX Inference Time: 2.7s per image
TRT Inference Time: 3.0s per image

库的版本如下:

- JetPack: 5.1
- CUDA: 11.4.315
- cuDNN: 8.6.0.166
- TensorRT: 8.5.2.2
- VPI: 2.2.4
- Vulkan: 1.3.204
- OpenCV: 4.5.4 - with CUDA: NO
- torch: 2.1.0
- torchvision: 0.16.0
- onnx: 1.16.1
- onnxruntime: 1.8.0

将pth文件转换成onnx文件的函数如下:

model_name = "zoedepth"
pretrained_resource = "local::./checkpoints/ZoeDepthIndoor_05-Jun_15-11-ebbebc6c1002_best.pt"
dataset = None
overwrite = {"pretrained_resource": pretrained_resource}
config = get_config(model_name, "eval", dataset, **overwrite)
model = build_model(config)
model.eval() 
dummy_input = torch.randn(1, 3, 392, 518)
 _ = model(dummy_input)
torch.onnx.export(model, dummy_input, "ZoeDepth_indoor.onnx", verbose=True)
torch.onnx.export(
            model,
             dummy_input, 
             "./checkpoints/ZoeDepth_indoor_jetson.onnx", 
             opset_version=11, 
             input_names=["input"], 
             output_names=["output"], 
)

将onnx文件转换成trt文件的函数如下:

def build_engine(onnx_file_path):
    onnx_file_path = Path(onnx_file_path)
    # ONNX to TensorRT
    logger = trt.Logger(trt.Logger.VERBOSE)
    builder = trt.Builder(logger)
    network = builder.create_network(1 << (int)(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
    parser = trt.OnnxParser(network, logger)

    with open(onnx_file_path, "rb") as model:
        if not parser.parse(model.read()):
            for error in range(parser.num_errors):
                print(parser.get_error(error))
            raise ValueError('Faled to parse the ONNX model.')

    # Set up the builder config
    config = builder.create_builder_config()
    config.set_flag(trt.BuilderFlag.FP16)  # FP16
    config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 2 << 30)  # 2 GB

    serialized_engine = builder.build_serialized_network(network, config)

    with open(onnx_file_path.with_suffix(".trt"), "wb") as f:
        f.write(serialized_engine)

使用trt文件进行推理的函数如下:

def infer_trt(engine, input_image):
    input_image = input_image.cpu().numpy().astype(np.float32)
    context = engine.create_execution_context()
    height, width = input_image.shape[2], input_image.shape[3]
    output_shape = (1, 1, height, width)
    # Allocate pagelocked memory
    h_input = cuda.pagelocked_empty(trt.volume((1, 3, height, width)), dtype=np.float32)
    h_output = cuda.pagelocked_empty(trt.volume((1, 1, height, width)), dtype=np.float32)

    # Allocate device memory
    d_input = cuda.mem_alloc(h_input.nbytes)
    d_output = cuda.mem_alloc(h_output.nbytes)

    bindings = [int(d_input), int(d_output)]
    stream = cuda.Stream()
    # Function to perform inference
    def perform_inference(images_np):
        np.copyto(h_input, images_np.ravel())
        cuda.memcpy_htod_async(d_input, h_input, stream)
        context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)
        cuda.memcpy_dtoh_async(h_output, d_output, stream)
        stream.synchronize()
        return torch.tensor(h_output).view(output_shape)
        # Run inference on original images

    pred1 = perform_inference(input_image)

    # Run inference on flipped images
    flipped_images_np = np.flip(input_image, axis=3)
    pred2 = perform_inference(flipped_images_np)
    pred2 = torch.flip(pred2, [3])
    mean_pred = 0.5 * (pred1 + pred2)
    return mean_pred

代码运行过程中除了转换成onnx文件的时候会有一些warning,其他全部正常运行。但是最后的结果还是不尽如人意,期待得到您的回复!

thinvy commented 3 months ago

在linux arm64平台默认pip安装的onnxruntime-gpu是通过tensorrt加速的(参考 https://onnxruntime.ai/getting-started ),如果是这样安装的话,和直接使用tensorrt再简单地导出个模型推理的性能基本一致,尤其是python的推理。

此外相比于桌面端GPU,tensorrt对orin上int8的推理加速效果较fp16提升较为明显,实际部署的话最好是可以进行int8量化或者8-16混合精度量化

zzzzzyh111 commented 3 months ago

谢谢您的回答,但我发现其实tensorrt和pth的推理速度也是基本一样的,因此: 1.我估计代码里的data loading和 preprocessing部分可能占了大部分时间,我会进一步打印每一步的时间并查找究竟是哪一部分耗时最久 2.关于使用Int8量化加速的提议非常好,但是我的任务对精度要求比较高,所以可能只会在目前情况得不到改善的前提下再考虑使用

感谢您的快速答复!