[Feature request / Idea] Syllable level VAD for mandarin voice

diyism commented 11 months ago

Feature

Syllable level VAD for mandarin, to segment every syllables of mandarin voice

Motivation

Typically every mandarin syllable cost the same fixed 0.3 seconds, every mandarin syllable contains only "a consonant + a vowel" or "a vowel" without any tail consonant(comparing to the 'sk' in the english syllable "desk"), so I think if there's a Syllable level VAD to segment mandarin voice in real time, we can achieve a very precise and very fast(get rid of the overlong tail vowel) mandarin-syllable-recognition-engine just like a 1300-key mechanical keyboard(totally 1300 mandarin syllables):

So that we can realize a general hotword detection engine(without fixed pretrained hotwords) or send the mandarin syllables sequence to GPT3.5 or claude for the best semantic understanding in the world.(ref: https://github.com/k2-fsa/sherpa-ncnn/issues/177)

There're some projects which target is about the syllable/pinyin segmentation, for example: whisper-timestamped: https://github.com/linto-ai/whisper-timestamped and a very concise tool aeneas: https://github.com/readbeyond/aeneas

Attached is the mandarin syllables spectrogram:

snakers4 commented 11 months ago

Hi,

Many thanks for your idea. It is quite lucky, because we just finished our HSK 1 and we have a basic grasp of Chinese phonetics now.

The issue here is that the Chinese phonetics is very unique. To a certain extent it is much more simple that European languages (including Russian), i.e. it has a very limited number of initials and finals (even if we take all tones into account).

On the other hand, making a full-blown phonetic aligner model for a particular language (that is supposed to work for any domain) is a bit complicated for a VAD model, i.e. it will be a particular language STT model. I believe there is no way to make it compatible with all languages, and work with a decent quality without increasing the model size.

zxl777 commented 9 months ago

I actually tested Mandarin. If you speak continuously, silero-vad can accurately recognize it.

Errors are only recognized when speaking word by word with pauses. But this situation is rare in actual environments.

diyism commented 9 months ago

I actually tested Mandarin. If you speak continuously, silero-vad can accurately recognize it.

Errors are only recognized when speaking word by word with pauses. But this situation is rare in actual environments.

Great, is there a colab script to share ?

snakers4 / silero-vad

[Feature request / Idea] Syllable level VAD for mandarin voice #402

Feature

Motivation