使用 sherpa-onnx-streaming-zipformer-multi-zh-hans-2023-12-12 模型，热词异常

jianking123 commented 6 months ago

使用命令行生成中文字热词文件后，在tokens.txt能找到对应的byte，但是使用时有异常微信图片_20240508112931 微信图片_20240508113007 微信图片_20240508113014

jianking123 commented 6 months ago

any update?

csukuangfj commented 6 months ago

但是使用时有异常

请贴 error log

jianking123 commented 6 months ago

但是使用时有异常

请贴 error log [Uploading 日志.txt…]()

jianking123 commented 6 months ago

15:13:55.836 SuggestManager E openApp name = com.k2fsa.sherpa.onnx 15:13:55.984 Perf I Connecting to perf service. 15:13:55.997 FeatureParser I can't find dipper.xml in assets/device_features/,it may be in /system/etc/device_features 15:13:56.008 libc E Access denied finding property "ro.vendor.df.effect.conflict" 15:13:56.013 Perf E Fail to get file list com.k2fsa.sherpa.onnx 15:13:56.013 Perf E getFolderSize() : Exception_1 = java.lang.NullPointerException: Attempt to get length of null array 15:13:56.055 ForceDarkHelper D updateByCheckExcludeList: pkg: com.k2fsa.sherpa.onnx activity: com.k2fsa.sherpa.onnx.MainActivity@a4ffbed 15:13:56.057 ForceDarkHelper D updateByCheckExcludeList: pkg: com.k2fsa.sherpa.onnx activity: com.k2fsa.sherpa.onnx.MainActivity@a4ffbed 15:13:56.062 fsa.sherpa.onn W Accessing hidden method Lmiui/contentcatcher/sdk/Token;->(Ljava/lang/String;Ljava/lang/String;Ljava/lang/String;I)V (greylist, linking, allowed) 15:13:56.062 fsa.sherpa.onn W Accessing hidden method Lmiui/contentcatcher/InterceptorProxy;->getWorkThread()Landroid/os/HandlerThread; (greylist, linking, allowed) 15:13:56.062 ViewCo...actory D initViewContentFetcherClass 15:13:56.062 ViewCo...actory D getInterceptorPackageInfo 15:13:56.062 fsa.sherpa.onn W Accessing hidden method Landroid/app/AppGlobals;->getInitialApplication()Landroid/app/Application; (greylist, linking, allowed) 15:13:56.063 ViewCo...actory D getInitialApplication took 0ms 15:13:56.063 ViewCo...actory D packageInfo.packageName: com.miui.catcherpatch 15:13:56.070 ViewCo...actory D initViewContentFetcherClass took 7ms 15:13:56.070 ContentCatcher I ViewContentFetcher : ViewContentFetcher 15:13:56.070 ViewCo...actory D createInterceptor took 7ms 15:13:56.070 fsa.sherpa.onn W Accessing hidden method Lmiui/contentcatcher/sdk/ContentCatcherManager;->getInstance()Lmiui/contentcatcher/sdk/ContentCatcherManager; (greylist, linking, allowed) 15:13:56.070 fsa.sherpa.onn W Accessing hidden method Lmiui/contentcatcher/sdk/ContentCatcherManager;->registerContentInjector(Lmiui/contentcatcher/sdk/Token;Lmiui/contentcatcher/sdk/injector/IContentDecorateCallback;)V (greylist, linking, allowed) 15:13:56.072 fsa.sherpa.onn W Accessing hidden method Lmiui/contentcatcher/sdk/ContentCatcherManager;->getPageConfig(Lmiui/contentcatcher/sdk/Token;)Lmiui/contentcatcher/sdk/data/PageConfig; (greylist, linking, allowed) 15:13:56.072 fsa.sherpa.onn W Accessing hidden method Lmiui/contentcatcher/sdk/data/PageConfig;->getFeatures()Ljava/util/ArrayList; (greylist, linking, allowed) 15:13:56.072 fsa.sherpa.onn W Accessing hidden method Lmiui/contentcatcher/sdk/data/PageConfig;->getCatchers()Ljava/util/ArrayList; (greylist, linking, allowed) 15:13:56.079 fsa.sherpa.onn W Accessing hidden method Landroid/view/View;->computeFitSystemWindows(Landroid/graphics/Rect;Landroid/graphics/Rect;)Z (greylist, reflection, allowed) 15:13:56.080 fsa.sherpa.onn W Accessing hidden method Landroid/view/ViewGroup;->makeOptionalFitsSystemWindows()V (greylist, reflection, allowed) 15:13:56.089 ForceDarkHelper D updateByCheckExcludeList: pkg: com.k2fsa.sherpa.onnx activity: com.k2fsa.sherpa.onnx.MainActivity@a4ffbed 15:13:56.092 chatty I uid=10200(com.k2fsa.sherpa.onnx) identical 1 line 15:13:56.093 ForceDarkHelper D updateByCheckExcludeList: pkg: com.k2fsa.sherpa.onnx activity: com.k2fsa.sherpa.onnx.MainActivity@a4ffbed 15:13:56.101 ForceDarkHelper D updateByCheckExcludeList: pkg: com.k2fsa.sherpa.onnx activity: com.k2fsa.sherpa.onnx.MainActivity@a4ffbed 15:13:56.122 sherpa-onnx I Start to initialize model 15:13:56.122 sherpa-onnx I Select model type 11 15:13:56.139 Online...Config I OnlineRecognizerConfig: OnlineRecognizerConfig(featConfig=FeatureConfig(sampleRate=16000, featureDim=80), modelConfig=OnlineModelConfig(transducer=OnlineTransducerModelConfig(encoder=sherpa-onnx-streaming-zipformer-multi-zh-hans-2023-12-12/encoder-epoch-20-avg-1-chunk-16-left-128.int8.onnx, decoder=sherpa-onnx-streaming-zipformer-multi-zh-hans-2023-12-12/decoder-epoch-20-avg-1-chunk-16-left-128.int8.onnx, joiner=sherpa-onnx-streaming-zipformer-multi-zh-hans-2023-12-12/joiner-epoch-20-avg-1-chunk-16-left-128.int8.onnx), paraformer=OnlineParaformerModelConfig(encoder=, decoder=), zipformer2Ctc=OnlineZipformer2CtcModelConfig(model=), tokens=sherpa-onnx-streaming-zipformer-multi-zh-hans-2023-12-12/tokens.txt, numThreads=1, debug=false, provider=cpu, modelType=zipformer2), lmConfig=OnlineLMConfig(model=, scale=0.5), ctcFstDecoderConfig=OnlineCtcFstDecoderConfig(graph=, maxActive=3000), endpointConfig=EndpointConfig(rule1=EndpointRule(mustContainNonSilence=false, minTrailingSilence=2.4, minUtteranceLength=0.0), rule2=EndpointRule(mustContainNonSilence=true, minTrailingSilence=1.4, minUtteranceLength=0.0), rule3=EndpointRule(mustContainNonSilence=false, minTrailingSilence=0.0, minUtteranceLength=20.0)), enableEndpoint=true, decodingMethod=modified_beam_search, maxActivePaths=4, hotwordsFile=sherpa-onnx-streaming-zipformer-multi-zh-hans-2023-12-12/hotwords_mix_b.txt, hotwordsScore=4.5) 15:13:56.139 sherpa-onnx W config: OnlineRecognizerConfig(feat_config=FeatureExtractorConfig(sampling_rate=16000, feature_dim=80, low_freq=20, high_freq=-400, dither=0), model_config=OnlineModelConfig(transducer=OnlineTransducerModelConfig(encoder="sherpa-onnx-streaming-zipformer-multi-zh-hans-2023-12-12/encoder-epoch-20-avg-1-chunk-16-left-128.int8.onnx", decoder="sherpa-onnx-streaming-zipformer-multi-zh-hans-2023-12-12/decoder-epoch-20-avg-1-chunk-16-left-128.int8.onnx", joiner="sherpa-onnx-streaming-zipformer-multi-zh-hans-2023-12-12/joiner-epoch-20-avg-1-chunk-16-left-128.int8.onnx"), paraformer=OnlineParaformerModelConfig(encoder="", decoder=""), wenet_ctc=OnlineWenetCtcModelConfig(model="", chunk_size=16, num_left_chunks=4), zipformer2_ctc=OnlineZipformer2CtcModelConfig(model=""), tokens="sherpa-onnx-streaming-zipformer-multi-zh-hans-2023-12-12/tokens.txt", num_threads=1, warm_up=0, debug=False, provider="cpu", model_type="zipformer2"), lm_config=OnlineLMConfig(model="", scale=0.5), endpoint_config=EndpointConfig(rule1=EndpointR 15:13:56.671 SuggestManager E openApp name = com.k2fsa.sherpa.onnx 15:13:57.049 libc E Access denied finding property "ro.hardware.chipname" 15:14:02.692 sherpa-onnx W Cannot find ID for token <0xE7> at line: ▁ 频 <0xE7> <0xB9> <0x81>. (Hint: words on the same line are separated by spaces) 15:14:02.692 sherpa-onnx W Encode hotwords failed.

pkufool commented 6 months ago

This fix is in https://github.com/k2-fsa/sherpa-onnx/pull/828 , will merge it as soon as possible.

lxp3 commented 2 weeks ago

It appears that simple-sentencepiece is unable to tokenize UTF-8 strings (BBPE CJK) correctly.

Python Example 1: Google's sentencepiece works fine. This code successfully produces the expected BPE tokens: ['▁ƋţŅ', '▁ƌŋţ', '▁ƌĭĺ', '▁ƋŠŒ'] with token IDs [6, 24, 433, 693].

  from byte_utils import byte_encode
  import sentencepiece as spm
  sp = spm.SentencePieceProcessor()
  rc = sp.load("onnx/bbpe.model")
  s = "你 好 北 京"
  s_utf8 = byte_encode(s) #  ƋţŅ ƌŋţ ƌĭĺ ƋŠŒ
  pieces = sp.encode(s_utf8 , out_type=str) # ['▁ƋţŅ', '▁ƌŋţ', '▁ƌĭĺ', '▁ƋŠŒ']

Python Example 2: simple-sentencepiece causes a segmentation fault (core dumped).

  from byte_utils import byte_encode
  from ssentencepiece import Ssentencepiece # pip install simple-sentencepiece
  ssp = Ssentencepiece("onnx/tokens.txt")
  s = "你 好 北 京"
  s_utf8 = byte_encode(s) #  ƋţŅ ƌŋţ ƌĭĺ ƋŠŒ
  pieces = ssp.encode(s_utf8, out_type=str) # raise  segmentation fault (core dumped).

On the C++ side, in sherpa-onnx, a core dump error also occurs when executing bpe_encode(). Our simple solution is use the google library "sentencepiece::SentencePieceProcessor bpe_encoder" as encoder, instead of "ssentencepiece::Ssentencepiece bpe_encoder".

csukuangfj commented 2 weeks ago

@pkufool please have a look.

k2-fsa / sherpa-onnx

使用 sherpa-onnx-streaming-zipformer-multi-zh-hans-2023-12-12 模型，热词异常 #842