synesthesiam / voice2json

Command-line tools for speech and intent recognition on Linux
MIT License
1.08k stars 63 forks source link

Possibility of improving Chinese speech recognition (speech to text) #76

Open SwimmingTiger opened 2 years ago

SwimmingTiger commented 2 years ago

I am using voice2json as a voice command recognition backend in my voice interaction mod for a video game. As a native Chinese speaker, I find voice2json's Chinese support rather limited:

So about any possibility of improving Chinese speech recognition

Intelligent Tokenizer (Word Segmenter)

Here is a simple project for it: fxsjy/Jieba. I use it for my application and it works good (I used the .NET port of it).

A demo:

pip3 install jieba

test.py

# encoding=utf-8
import jieba

strs=[
    "我来到北京清华大学",
    "乒乓球拍卖完了",
    "中国科学技术大学",
    "他来到了网易杭研大厦",
    "小明硕士毕业于中国科学院计算所,后在日本京都大学深造"
]

for str in strs:
    seg_list = jieba.cut(str)
    print(' '.join(list(seg_list)))

Result:

Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.458 seconds.
Prefix dict has been built successfully.
我 来到 北京 清华大学
乒乓球 拍卖 完 了
中国 科学技术 大学
他 来到 了 网易 杭研 大厦
小明 硕士 毕业 于 中国科学院 计算所 , 后 在 日本京都大学 深造

An HMM model will be used for new word prediction.

Pronunciation Prediction

Chinese pronunciation is character-based. The pronunciation of Chinese words is the concatenation of the pronunciation of each character.

So, split the unknown word into individual characters and get the pronunciation and splice it, and you have the pronunciation of the unknown word. This doesn't even require training a neural network.

I use this method in my program and it works well. If the word returned by jieba.cut() is not in base_dictionary.txt, I split it into a sequence of single Chinese characters.

日本京都大学 -> 日 本 京 都 大 学 -> r iz4 b en3 j ing1 d u1 d a4 x ve2

Completely correct.

The only caveat is that some characters may have multiple pronunciations, and you need to take into account the possibility of each pronunciation when combining them. At this point, training a neural network is more advantageous. However, even without training a neural network, it is possible to generate pronunciations, which can be assumed to have equal probability for each pronunciation.

虎绿林 -> 虎 绿 林 -> (h u3 l v4 l in2 | h u3 l u4 l in2)

IPA pronunciation dictionary

I have one: https://github.com/SwimmingTiger/BigCiDian

Chao tone letters (IPA) are used to mark pitch.

This dictionary contains pronunciations of Chinese words and common English words.

Foreign language support

English words sometimes appear in spoken and written Chinese, and these words retain their English written form.

eg. 我买了一台Mac笔记本,用的是macOS,我用起来还是不习惯,等哪天给它装个Windows系统。

Therefore, Chinese speech recognition engines usually need to have the ability to process two languages at the same time. If an English word is encountered, it is processed according to English rules (including pronunciation prediction).

If it is a Chinese word or a compound word (such as "U盘", means USB Flash Drive), it will be processed according to Chinese rules.

For example, in word segmentation, English words cannot be split into individual characters.

It seems possible to train a model that includes both Chinese and English. Of course it might be convenient if voice2json supports model mixing - Combine pure Chinese model and pure English model into the same model - I don't know if it's technically possible.

Number to Words

Here is a complete C# implementation.

Finding or writing a well-rounded Python implementation doesn't seem that hard.

Audio Corpora

Mozilla Common Voice already has a big enough Chinese Audio Corpora:

Convert between Simplified Chinese and Traditional Chinese

Traditional Chinese and Simplified Chinese are just different written forms of Chinese characters, their spoken language is the same.

https://github.com/SwimmingTiger/BigCiDian is a Simplified Chinese pronunciation dictionary (without traditional Chinese characters). So it may be easier to deal with converting all texts into Simplified Chinese.

https://github.com/yichen0831/opencc-python can do this very well.

test.py pip3 install opencc-python-reimplemented

from opencc import OpenCC
cc = OpenCC('t2s')  # convert from Traditional Chinese to Simplified Chinese
to_convert = '開放中文轉換'
converted = cc.convert(to_convert)
print(converted)

Result: 开放中文转换

Convert it before tokenization (word segmentation).

Calling t2s conversion on Simplified Chinese has no side effects. So there is no need to detect before conversion.

Complete preprocessing pipeline for text

Convert Traditional to Simplified -> Number to Words -> Tokenizer (Word Segmentation) -> Convert to Pronunciation -> Unknown Word Pronunciation Prediction (Chinese and English may have different modes, handwritten code or neural network)

Why does the number-to-word appear before the tokenizer?

Because the output of number-to-word is also a Chinese sentence, there is no space separation between words.

Model Training

I want to train a Chinese kaldi model for voice2json. Maybe I can use the steps and tools of Rhasspy.

To train a Chinese model using https://github.com/rhasspy/ipa2kaldi, it looks like I need to add Chinese support to https://github.com/rhasspy/gruut.

If there is any progress, I will update here. Any suggestions are also welcome.

synesthesiam commented 1 year ago

Thank you for the excellent information! I will look more into this, and see what I can do.

It seems possible to train a model that includes both Chinese and English. Of course it might be convenient if voice2json supports model mixing - Combine pure Chinese model and pure English model into the same model - I don't know if it's technically possible.

This would only be possible if I had some way of splitting the audio up into Chinese and English portions. I do think a mixed model could be trained to handle both, however.

If there is any progress, I will update here. Any suggestions are also welcome.

I wonder if we could use these same techniques to train a text to speech voice for Chinese using mimic3