k2-fsa / sherpa-onnx

Speech-to-text, text-to-speech, speaker recognition, and VAD using next-gen Kaldi with onnxruntime without Internet connection. Support embedded systems, Android, iOS, Raspberry Pi, RISC-V, x86_64 servers, websocket server/client, C/C++, Python, Kotlin, C#, Go, NodeJS, Java, Swift, Dart, JavaScript, Flutter, Object Pascal, Lazarus, Rust
https://k2-fsa.github.io/sherpa/onnx/index.html
Apache License 2.0
3.16k stars 369 forks source link

whisper onnx convert to rknn #979

Closed HduHestin closed 3 months ago

HduHestin commented 3 months ago

Hello,there! i have looked the https://github.com/k2-fsa/sherpa-onnx/blob/master/scripts/whisper/export-onnx.py it's a nice try to use whisper model in different platform by using ONNX format. I'm trying to convert it to rknn format(a model type in RKs device) to use rknpu. But i meet some obstacles.

1.I use the export-onnx.py export encoder and decoder successfully.

2.I try to build a script to convert onnx to rknn. By netron i can see the structure of onnx. The input seems to be a dynamic shape.[n_audio,80,T],So i use dynamic_input in exporting rknn(code below),Then i successfully export it(encoder.rknn). But when i run it in RK3568 device ,somethings go wrong. Do you have interest in convert to rknn? Hope you can give me some advice. image

image

from rknn.api import RKNN
import os
import onnx
import sys

dynamic_input=[
    [[1,80,3000]]
]

model_path='base-models/base-encoder.onnx'
model = onnx.load(model_path)
onnx.checker.check_model(model)
print("The model is checked!")

# Create RKNN object
rknn = RKNN(verbose=False)

# Pre-process config
print('--> Config model')
rknn.config(target_platform='rk3568',
            dynamic_input=dynamic_input,
            )
print('done')

# Load model
print('--> Loading model')
ret = rknn.load_onnx(model=model_path)
if ret != 0:
    print('Load model failed!')
    exit(ret)
print('done')

 # Build model
print('--> Building model')
ret = rknn.build(do_quantization=False)
if ret != 0:
    print('Build model failed!')
    exit(ret)
print('done')

# Export rknn model
print('--> Export rknn model')
ret = rknn.export_rknn('base-encoder-3568-int8.rknn',gen_cpp_demo=True)
if ret != 0:
    print('Export rknn model failed!')
    exit(ret)
print('done')

# 
rknn.release()

rknn.py:

import cv2
import numpy as np
import platform
from rknnlite.api import RKNNLite
import argparse
import base64
from typing import Tuple

import kaldi_native_fbank as knf
import onnxruntime as ort
import torch
import torchaudio

from test import *

# get current platform Structure
DEVICE_COMPATIBLE_NODE = '/proc/device-tree/compatible'

def get_host():
    # get platform and device type
    system = platform.system()
    machine = platform.machine()
    os_machine = system + '-' + machine
    if os_machine == 'Linux-aarch64':
        try:
            with open(DEVICE_COMPATIBLE_NODE) as f:
                device_compatible_str = f.read()
                # print(device_compatible_str)
                # SAMPLES : embedfire,lubancat-2-v2rockchip,rk3568
                if 'rk3562' in device_compatible_str:
                    host = 'RK3562'
                elif 'rk3576' in device_compatible_str:
                    host = 'RK3576'
                elif 'rk3588' in device_compatible_str:
                    host = 'RK3588'
                else:
                    host = 'RK3566_RK3568'
        except IOError:
            print('Read device node {} failed.'.format(DEVICE_COMPATIBLE_NODE))
            exit(-1)
    else:
        host = os_machine
    return host

Model_path='base-encoder-rk3568.rknn'

sound_file='1s.wav'
Decoder_path='tiny-decoder.onnx'
Encoder_path='tiny-encoder.onnx'
Tokens='tiny-tokens.txt'

mel = compute_features(sound_file)

print(mel.shape)

np_mel=mel.numpy()
print(type(np_mel))
print(np_mel.shape)

#model = OnnxModel(Encoder_path,Decoder_path)
#n_layer_cross_k,n_layer_cross_v=model.run_encoder(mel)

#print(type(n_layer_cross_k),type(n_layer_cross_v))

# print(n_layer_cross_k.shape,n_layer_cross_v.shape)

#host_name=get_host()
#print(host_name)
rknn_lite = RKNNLite()
# load rknn model
ret = rknn_lite.load_rknn(Model_path)
if ret !=0:
    print('Load rknn model failed')
    exit(ret)
print('Done!')

# Init runtime environment
print('--> Init runtime environment')
# Run on RK356x / RK3576 / RK3588 with Debian OS, do not need specify target.
ret = rknn_lite.init_runtime()
if ret != 0:
    print('Init runtime environment failed')
    exit(ret)
print('done')

# Inference
print('--> Running model')
outputs=rknn_lite.inference(inputs=[np_mel])

print(type(outputs))
csukuangfj commented 3 months ago

Please remove https://github.com/k2-fsa/sherpa-onnx/blob/69347ffc8f299060a11e43cf6e194822b6356581/scripts/whisper/export-onnx.py#L104

and https://github.com/k2-fsa/sherpa-onnx/blob/69347ffc8f299060a11e43cf6e194822b6356581/scripts/whisper/export-onnx.py#L416-L420 and https://github.com/k2-fsa/sherpa-onnx/blob/69347ffc8f299060a11e43cf6e194822b6356581/scripts/whisper/export-onnx.py#L540-L546

And retry.

HduHestin commented 3 months ago

Thanks for your attention. i find a part is distinct. I check my yolov8-pose which successfully use rknn,and i open the verbose when init rknn_runtime, you can see firstlayer all information is complete. but in whisper the first layer DataFormat is missing. I don't know whether this causes Aborted.

yolov8-pose verbose: image Netron show: image

whisper verbose: image whisper Netron show: image

csukuangfj commented 3 months ago

whisper uses a 3-d input, not a 4-d.

HduHestin commented 3 months ago

Yes , i use a random np.ndarry with shape:(1,80,3000) to test rknn, actually still aborted. i'm thinking whether the "DataFormat" int first line causes abort.

csukuangfj commented 3 months ago

Please try to convert the model layer by layer and also debug it layer by layer.

This is issue is out of scope of sherpa-onnx.

jinchao123 commented 3 months ago

Is the problem solved?