Open jianking123 opened 6 months ago
any update?
但是使用时有异常
请贴 error log
但是使用时有异常
请贴 error log [Uploading 日志.txt…]()
15:13:55.836 SuggestManager E openApp name = com.k2fsa.sherpa.onnx
15:13:55.984 Perf I Connecting to perf service.
15:13:55.997 FeatureParser I can't find dipper.xml in assets/device_features/,it may be in /system/etc/device_features
15:13:56.008 libc E Access denied finding property "ro.vendor.df.effect.conflict"
15:13:56.013 Perf E Fail to get file list com.k2fsa.sherpa.onnx
15:13:56.013 Perf E getFolderSize() : Exception_1 = java.lang.NullPointerException: Attempt to get length of null array
15:13:56.055 ForceDarkHelper D updateByCheckExcludeList: pkg: com.k2fsa.sherpa.onnx activity: com.k2fsa.sherpa.onnx.MainActivity@a4ffbed
15:13:56.057 ForceDarkHelper D updateByCheckExcludeList: pkg: com.k2fsa.sherpa.onnx activity: com.k2fsa.sherpa.onnx.MainActivity@a4ffbed
15:13:56.062 fsa.sherpa.onn W Accessing hidden method Lmiui/contentcatcher/sdk/Token;->
This fix is in https://github.com/k2-fsa/sherpa-onnx/pull/828 , will merge it as soon as possible.
It appears that simple-sentencepiece is unable to tokenize UTF-8 strings (BBPE CJK) correctly.
Python Example 1: Google's sentencepiece works fine. This code successfully produces the expected BPE tokens: ['▁ƋţŅ', '▁ƌŋţ', '▁ƌĭĺ', '▁ƋŠŒ'] with token IDs [6, 24, 433, 693].
from byte_utils import byte_encode
import sentencepiece as spm
sp = spm.SentencePieceProcessor()
rc = sp.load("onnx/bbpe.model")
s = "你 好 北 京"
s_utf8 = byte_encode(s) # ƋţŅ ƌŋţ ƌĭĺ ƋŠŒ
pieces = sp.encode(s_utf8 , out_type=str) # ['▁ƋţŅ', '▁ƌŋţ', '▁ƌĭĺ', '▁ƋŠŒ']
Python Example 2: simple-sentencepiece causes a segmentation fault (core dumped).
from byte_utils import byte_encode
from ssentencepiece import Ssentencepiece # pip install simple-sentencepiece
ssp = Ssentencepiece("onnx/tokens.txt")
s = "你 好 北 京"
s_utf8 = byte_encode(s) # ƋţŅ ƌŋţ ƌĭĺ ƋŠŒ
pieces = ssp.encode(s_utf8, out_type=str) # raise segmentation fault (core dumped).
On the C++ side, in sherpa-onnx, a core dump error also occurs when executing bpe_encode(). Our simple solution is use the google library "sentencepiece::SentencePieceProcessor bpe_encoder" as encoder, instead of "ssentencepiece::Ssentencepiece bpe_encoder".
@pkufool please have a look.
使用命令行生成中文字热词文件后,在tokens.txt能找到对应的byte,但是使用时有异常