Possibility of improving Chinese speech recognition (speech to text)

I am using voice2json as a voice command recognition backend in my voice interaction mod for a video game. As a native Chinese speaker, I find voice2json's Chinese support rather limited:

voice2json does not perform Chinese word segmentation, which means that users must perform word segmentation in sentences.ini by themselves.

In order to use voice2json, my program had to do Chinese word segmentation when generating sentences.ini.
Pronunciation prediction doesn't seem to work at all. Any word that is not in the dictionary is completely unrecognizable.

In order not to lose any words in the sentence, my program splits any Chinese words that are not in base_dictionary.txt into individual Chinese characters, so that they are in the dictionary and voice2json can handle it.
No ability to deal with foreign languages. All English words appearing in the sentence seem to be discarded.

My program can't do anything about it. Any foreign words in the sentence can simply be discarded.
The only available PocketSphinx and CMU models have poor recognition performance, with recognition accuracy far lower than the Microsoft Speech Recognition API that comes with Windows, and much worse than the English kaldi model.

This has reached an unusable level for my program. I would recommend Chinese users to use the old Microsoft speech recognition engine.

However, one English user gave excellent feedback:

The new speech recognition is much better then default windows one, it gets conversations almost every time, and takes a fraction of the time.

This is also the same as my own test. I was impressed that the default en-us_kaldi-zamia model gave extremely accurate results in a very short time even when I spoke with a crappy foreign accent.

So about any possibility of improving Chinese speech recognition

Intelligent Tokenizer (Word Segmenter)

Here is a simple project for it: fxsjy/Jieba. I use it for my application and it works good (I used the .NET port of it).

A demo:

pip3 install jieba

test.py

# encoding=utf-8
import jieba

strs=[
    "我来到北京清华大学",
    "乒乓球拍卖完了",
    "中国科学技术大学",
    "他来到了网易杭研大厦",
    "小明硕士毕业于中国科学院计算所，后在日本京都大学深造"
]

for str in strs:
    seg_list = jieba.cut(str)
    print(' '.join(list(seg_list)))

Result:

Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.458 seconds.
Prefix dict has been built successfully.
我 来到 北京 清华大学
乒乓球 拍卖 完 了
中国 科学技术 大学
他 来到 了 网易 杭研 大厦
小明 硕士 毕业 于 中国科学院 计算所 ， 后 在 日本京都大学 深造

An HMM model will be used for new word prediction.

Pronunciation Prediction

Chinese pronunciation is character-based. The pronunciation of Chinese words is the concatenation of the pronunciation of each character.

So, split the unknown word into individual characters and get the pronunciation and splice it, and you have the pronunciation of the unknown word. This doesn't even require training a neural network.

I use this method in my program and it works well. If the word returned by jieba.cut() is not in base_dictionary.txt, I split it into a sequence of single Chinese characters.

日本京都大学 -> 日 本 京 都 大 学 -> r iz4 b en3 j ing1 d u1 d a4 x ve2

Completely correct.

The only caveat is that some characters may have multiple pronunciations, and you need to take into account the possibility of each pronunciation when combining them. At this point, training a neural network is more advantageous. However, even without training a neural network, it is possible to generate pronunciations, which can be assumed to have equal probability for each pronunciation.

虎绿林 -> 虎 绿 林 -> (h u3 l v4 l in2 | h u3 l u4 l in2)

IPA pronunciation dictionary

I have one: https://github.com/SwimmingTiger/BigCiDian

Chao tone letters (IPA) are used to mark pitch.

This dictionary contains pronunciations of Chinese words and common English words.

Foreign language support

English words sometimes appear in spoken and written Chinese, and these words retain their English written form.

eg. 我买了一台Mac笔记本，用的是macOS，我用起来还是不习惯，等哪天给它装个Windows系统。

Therefore, Chinese speech recognition engines usually need to have the ability to process two languages at the same time. If an English word is encountered, it is processed according to English rules (including pronunciation prediction).

If it is a Chinese word or a compound word (such as "U盘", means USB Flash Drive), it will be processed according to Chinese rules.

For example, in word segmentation, English words cannot be split into individual characters.

It seems possible to train a model that includes both Chinese and English. Of course it might be convenient if voice2json supports model mixing - Combine pure Chinese model and pure English model into the same model - I don't know if it's technically possible.

Number to Words

Here is a complete C# implementation.

Finding or writing a well-rounded Python implementation doesn't seem that hard.

Audio Corpora

Mozilla Common Voice already has a big enough Chinese Audio Corpora:

Convert between Simplified Chinese and Traditional Chinese

Traditional Chinese and Simplified Chinese are just different written forms of Chinese characters, their spoken language is the same.

https://github.com/SwimmingTiger/BigCiDian is a Simplified Chinese pronunciation dictionary (without traditional Chinese characters). So it may be easier to deal with converting all texts into Simplified Chinese.

https://github.com/yichen0831/opencc-python can do this very well.

test.py pip3 install opencc-python-reimplemented

from opencc import OpenCC
cc = OpenCC('t2s')  # convert from Traditional Chinese to Simplified Chinese
to_convert = '開放中文轉換'
converted = cc.convert(to_convert)
print(converted)

Result: 开放中文转换

Convert it before tokenization (word segmentation).

Calling t2s conversion on Simplified Chinese has no side effects. So there is no need to detect before conversion.

Complete preprocessing pipeline for text

Convert Traditional to Simplified -> Number to Words -> Tokenizer (Word Segmentation) -> Convert to Pronunciation -> Unknown Word Pronunciation Prediction (Chinese and English may have different modes, handwritten code or neural network)

Why does the number-to-word appear before the tokenizer?

Because the output of number-to-word is also a Chinese sentence, there is no space separation between words.

Model Training

I want to train a Chinese kaldi model for voice2json. Maybe I can use the steps and tools of Rhasspy.

To train a Chinese model using https://github.com/rhasspy/ipa2kaldi, it looks like I need to add Chinese support to https://github.com/rhasspy/gruut.

If there is any progress, I will update here. Any suggestions are also welcome.

synesthesiam / voice2json