spacewalk01 / depth-anything-tensorrt

TensorRT implementation of Depth-Anything V1, V2
https://depth-anything.github.io/
MIT License
270 stars 32 forks source link

Slow TensorRT Inference Speed on Jetson Orin NX #35

Open zzzzzyh111 opened 3 months ago

zzzzzyh111 commented 3 months ago

Thank you for your excellent work! :satisfied: :satisfied: :satisfied:

Recently, I have been trying to use TensorRT to accelerate Depth Anything on Jetson Orin NX. However, I found that the inference speed of the converted TRT file does not significantly improve compared to the ONNX file, and it even decreases. Specifically:

ONNX Inference Time: 2.7s per image
TRT Inference Time: 3.0s per image

The library versions are as follows:

- JetPack: 5.1
- CUDA: 11.4.315
- cuDNN: 8.6.0.166
- TensorRT: 8.5.2.2
- VPI: 2.2.4
- Vulkan: 1.3.204
- OpenCV: 4.5.4 - with CUDA: NO
- torch: 2.1.0
- torchvision: 0.16.0
- onnx: 1.16.1
- onnxruntime: 1.8.0

The function to convert the .pth file to an ONNX file is as follows:

model_name = "zoedepth"
pretrained_resource = "local::./checkpoints/ZoeDepthIndoor_05-Jun_15-11-ebbebc6c1002_best.pt"
dataset = None
overwrite = {"pretrained_resource": pretrained_resource}
config = get_config(model_name, "eval", dataset, **overwrite)
model = build_model(config)
model.eval() 
dummy_input = torch.randn(1, 3, 392, 518)
 _ = model(dummy_input)
torch.onnx.export(model, dummy_input, "ZoeDepth_indoor.onnx", verbose=True)
torch.onnx.export(
            model,
             dummy_input, 
             "./checkpoints/ZoeDepth_indoor_jetson.onnx", 
             opset_version=11, 
             input_names=["input"], 
             output_names=["output"], 
)

The function to convert the ONNX file to a TRT file is as follows:

def build_engine(onnx_file_path):
    onnx_file_path = Path(onnx_file_path)
    # ONNX to TensorRT
    logger = trt.Logger(trt.Logger.VERBOSE)
    builder = trt.Builder(logger)
    network = builder.create_network(1 << (int)(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
    parser = trt.OnnxParser(network, logger)

    with open(onnx_file_path, "rb") as model:
        if not parser.parse(model.read()):
            for error in range(parser.num_errors):
                print(parser.get_error(error))
            raise ValueError('Faled to parse the ONNX model.')

    # Set up the builder config
    config = builder.create_builder_config()
    config.set_flag(trt.BuilderFlag.FP16)  # FP16
    config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 2 << 30)  # 2 GB

    serialized_engine = builder.build_serialized_network(network, config)

    with open(onnx_file_path.with_suffix(".trt"), "wb") as f:
        f.write(serialized_engine)

The function to perform inference using the TRT file is as follows:

def infer_trt(engine, input_image):
    input_image = input_image.cpu().numpy().astype(np.float32)
    context = engine.create_execution_context()
    height, width = input_image.shape[2], input_image.shape[3]
    output_shape = (1, 1, height, width)
    # Allocate pagelocked memory
    h_input = cuda.pagelocked_empty(trt.volume((1, 3, height, width)), dtype=np.float32)
    h_output = cuda.pagelocked_empty(trt.volume((1, 1, height, width)), dtype=np.float32)

    # Allocate device memory
    d_input = cuda.mem_alloc(h_input.nbytes)
    d_output = cuda.mem_alloc(h_output.nbytes)

    bindings = [int(d_input), int(d_output)]
    stream = cuda.Stream()
    # Function to perform inference
    def perform_inference(images_np):
        np.copyto(h_input, images_np.ravel())
        cuda.memcpy_htod_async(d_input, h_input, stream)
        context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)
        cuda.memcpy_dtoh_async(h_output, d_output, stream)
        stream.synchronize()
        return torch.tensor(h_output).view(output_shape)
        # Run inference on original images

    pred1 = perform_inference(input_image)

    # Run inference on flipped images
    flipped_images_np = np.flip(input_image, axis=3)
    pred2 = perform_inference(flipped_images_np)
    pred2 = torch.flip(pred2, [3])
    mean_pred = 0.5 * (pred1 + pred2)
    return mean_pred

The code runs without any issues, except for some warnings during the ONNX conversion. However, the final results are still not satisfactory. Looking forward to your response! :heart: :heart: :heart:

spacewalk01 commented 3 months ago

Previously I have never seen tensorrt runs slower than onnx.

zzzzzyh111 commented 3 months ago

Thanks for your prompt reply! Am I correct in understanding that if nothing goes wrong during the conversion of onnx files to trt files, the acceleration should theoretically be achieved?

spacewalk01 commented 3 months ago

Yes. will you try tensorrt cpp version?

zzzzzyh111 commented 3 months ago

Since I'm unfamiliar with C++, I'm currently focusing on the Python version and using your work as a reference. If our previous discussion is correct, then the data loading and preprocessing process in my script might consume most of the time. I will continue investigating to find out the cause. If everything still seems to be good but it's not working, I will attempt the C++ version and let you know.

Thank you again for your prompt reply!

akashkrishnapm commented 2 months ago

Was it solved.