RKNN NPU Depth Anything Conversion

Nik057 commented 3 months ago

Hi!

I tried a conversion of this model starting from torch to ONNX to RKNN to use that on OrangePi 5 NPUs (RK3588s).

I noticed that going from ONNX to RKNN is possible only using opset <= 16, latest versions are using LayerNormalization layers which are not supported:

E RKNN: [00:22:08.456] Op type:LayerNormalization, name: LayerNormalization:/blocks.0/norm1/LayerNormalization, fallback cpu failed. please try updating to the latest version of the toolkit2 and runtime from: https://console.zbox.filez.com/l/I00fc3 (PWD: rknn) E RKNN: [00:22:08.545] Unsupport LayerNormalization! Please lower the OPSET version of the onnx model to below 16.

Using instead a model with opset <=16 will have lower performance on NPU than running ONNX on CPU. Is there a way to convert this model to RKNN and have optimal performance?

Conversion code used:

PTH TO ONNX:

import argparse
import torch
print(torch.version)
from onnx import load_model, save_model
from onnxruntime.tools.symbolic_shape_infer import SymbolicShapeInference
from depth_anything.dpt import DPT_DINOv2
from onnxsim import simplify
from torch.nn.utils import prune

def parse_args() -> argparse.Namespace:
parser = argparse.ArgumentParser()
parser.add_argument("--model", type=str, choices=["s", "b", "l"], default="s",
help="Model size variant. Available options: 's', 'b', 'l'.")
parser.add_argument("--output", type=str, default=None, required=False, help="Path to save the ONNX model.")
return parser.parse_args()

def export_onnx(model: str, output: str = None):
if output is None:
output = f"weights/depth_anything_vit{model}14_420x640_op19.onnx"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
image = torch.randn(1, 3, 420, 640).to(device)
if model == "s":
depth_anything = DPT_DINOv2(encoder="vits", features=64, out_channels=[48, 96, 192, 384])
elif model == "b":
depth_anything = DPT_DINOv2(encoder="vitb", features=128, out_channels=[96, 192, 384, 768])
else: # model == "l"
depth_anything = DPT_DINOv2(encoder="vitl", features=256, out_channels=[256, 512, 1024, 1024])
depth_anything.to(device).load_state_dict(torch.hub.load_state_dict_from_url(f"[https://huggingface.co/spaces/LiheYoung/Depth-Anything/resolve/main/checkpoints/depth_anything_vit{model}14.pth](https://huggingface.co/spaces/LiheYoung/Depth-Anything/resolve/main/checkpoints/depth_anything_vit%7Bmodel%7D14.pth)", map_location="cuda"), strict=True)
depth_anything.eval()

torch.onnx.export(depth_anything, image, output, input_names=["image"], output_names=["depth"], opset_version=19)
model_onnx = load_model(output)
model_simplified, check = simplify(model_onnx)
assert check, "Simplification failed"
save_model(model_simplified, output)
if name == "main":
args = parse_args()
export_onnx(**vars(args))

ONNX TO RKNN:

import numpy as np
from rknn.api import RKNN
import cv2

QUANTIZE_ON = False
yolo_name = "depth_anything_vits14_420x640_op19"
RKNN_MODEL = f'{yolo_name}.rknn'

rknn = RKNN(verbose=True)

rknn.config(mean_values=[[123.675, 116.28, 103.53]], std_values=[[58.395, 57.12, 57.375]], target_platform='rk3588')

print('--> Loading model')
ret = rknn.load_onnx(model=f'{yolo_name}.onnx')
if ret != 0:
print('Load model failed!')
exit(ret)
print('done')

Build model
print('--> Building model')
ret = rknn.build(do_quantization=QUANTIZE_ON)
if ret != 0:
print('Build model failed!')
exit(ret)
print('done')

Export RKNN model
print('--> Export rknn model')
ret = rknn.export_rknn(RKNN_MODEL)
if ret != 0:
print('Export rknn model failed!')
exit(ret)
print('done')

Init runtime environment
print('--> Init runtime environment')
ret = rknn.init_runtime()
if ret != 0:
print('Init runtime environment failed!')
exit(ret)
print('done')

Thank you in advance for any advice

happyme531 commented 3 months ago

Using instead a model with opset <=16 will have lower performance on NPU than running ONNX on CPU. Is there a way to convert this model to RKNN and have optimal performance?

It is actually very common to get a "RKNN NPU model" slower than CPU ¯\_(ツ)_/¯ because the limitation of NPU and their driver. opset version is not the problem.

zen-xingle commented 3 months ago

Hello, thanks for your issue report.

Recently, RKNN-toolkit2(2.0.0beta) is released and I test depth_anything_vits with 1x3x420x644 input. The inference time is almost 1s for one frame at int8. I thought in the next version it could be promoted to 0.5s at int8.

May I ask for your expected performance and, the CPU performance?

Nik057 commented 3 months ago

Hello @zen-xingle, thanks for the answer, I was hoping RKNN would be faster than ONNX, and it's cool to see that the latest version has doubled the inference speed compared to before. That's a big win!

Right now, CPU performance seems pretty much the same, maybe a tad faster, around 600ms for a 320x420 resolution. But RKNN was taking about 1700ms with version 1.6.0, and now with the 2.0 release, it's down to around 750ms. It's still not quite as fast as the CPU, but it's a massive improvement.

Excited to see what comes next! Thanks a bunch for the update.

rockchip-linux / rknn-toolkit2

RKNN NPU Depth Anything Conversion #296