Open w11wo opened 5 months ago
@pkufool Could you have a look?
Emm, May be I we can add a option like do-not-tokenize
, I think it should fix your issue.
For now, I think you can use the older version, v1.9.24 .
@pkufool Yes, the do-not-tokenize
option sounds good.
I can stick to older versions for now, but I wanted to try the customizable per-word hotwords scores, which comes in the latest releases only, hence the need for this new feature.
@w11wo OK, will make a PR.
Hi @pkufool, is there an update on the PR?
Hi @pkufool, is there an update on the PR?
There is an on-going PR https://github.com/k2-fsa/sherpa-onnx/pull/1039
Thank you so much @pkufool. Looking forward to it getting merged.
Hi. I have a phoneme-based Zipformer model.
Before this PR, I was able to apply hotwords encoding for phoneme sequences, e.g.
ɪ z/dʒ ʌ s t/b ɛ s t
, following the older implementation of e.g. Chinese character hotwords encoding. But now, I noticed that the Chinese character hotwords encoding have changed from深 度 学 习
(whitespace between chars) to深度学习
(no whitespace). And I assume the string parser will simply iterate through the non-whitespace characters in the string sequence.This, however, breaks my use case, since phoneme sequence with digraphs, e.g.
dʒ ʌ s t
will be incorrectly split tod ʒ ʌ s t
. The issue is that my model's vocab supports digraph and requires the old implementation.Is it possible to add another modeling unit, other than the currently supported ones (cjk, BPE, cjk+BPE)? Maybe instead of iterating for every non-whitespace character, split by whitespace first? This new modeling unit can hopefully support other use cases similar to mine.
Massive thanks for all the work and help thus far!