Open Tortoise17 opened 5 years ago
I never use Mozilla Common Voice btw, but the easiest way is to format your data in to LJSpeech format which consist of :
ID
: this is the name of the corresponding .wav fileTranscription
: words spoken by the reader (UTF-8)Normalized Transcription
: transcription with numbers, ordinals, and monetary units expanded into full words (UTF-8).It worked for me when building TTS for another language. CMIIW.
I have to train for other language. Not English. !! So, and speech to text.
LJSpeech format has nothing to do with language. It just format which, in the context of this repository, is used to form the dataset. Once I created for Indonesia language dataset, I followed the format as following :
ID,Transcription, Normalized Transcription 0001,1 2 3 sayang semuanya,satu dua tiga sayang semuanya 0002,saya suka baju yang berwarna merah muda,saya suka baju yang berwarna merah muda 0003,bali adalah pulau tujuan wisata no 1 di dunia,bali adalah pulau tujuan wisata nomor satu di dunia and so forth
But, LJSpeech is meant to be TTS (text-to-speech) dataset and I think it's not suitable used for STT (speech-to-text) as for recognizing speech you need various voice from many people. You're right using Mozilla Common Voice as the dataset, but I think this is not the correct repository to ask for. Please head to STT repo like Kaldi or CMUSphinx. It may answer your question : https://github.com/kaldi-asr/kaldi/tree/master/egs/commonvoice/s5
It would be great if Common Voice would be supported out of the box. It looks like it slowly becomes the biggest language dataset for most languages and for many languages it is the only available one under a free license. One can easily select subsets from the dataset, e.g. only male voices with a certain accent. This makes the dataset very intersting for TTS.
Converting the data format is one thing. But the common voice dataset uses mp3. Would it be neccessary to convert everything into .wav - files to use the data in tacotron? Common voice also has no normalized transcription of the text, is that a problem?
Hi stefanrotz. Did you get rid of your problem? I would train me too by using Mozilla mp3 files. Shall I convert to .wav?
@japita-se no I havent solved this problem yet, I first need more information.
Hi. I succeed in part. I converted all Mozilla mp3 files into wav by using sox. Now I'm training. Howere, after 3000 steps the synthesis is still pure noise. I think I need more steps. I am just wondering if the dataset is correct: remember that Mozilla Dataset is multispeaker ,while LibVox is single speaker. I wonder that the results will not bi good. Maybe @keithito can say something on this.
@japita-se did you use the complete Common Voice dataset or did you select one single speaker? most TTS engines can only be trained by one single voice. Best do use the speaker with the biggest number of recordings.
Common voice recordings tend to be of lower quality then what you would ideally use for building a tts model. Background noise, static, bumps, hisses, etc., are present in many of them, and those seem to get magnified by tacotron. A single speaker using a studio recording setup would be the kind of source you'd ideally use.
I have tried to use common voice dataset with this code. Still unable to use. for speech to text Can you guide me what changes need to be marked ? Specially any data other than English. If anyone can help.
Bro, tacotron is used to make text to speech..., not speech to text. But if we can make model 'Indonesian speech model' to run with tacotron. Of course combine with any module to read Indonesian text and save it as wav file..., so we can easily create a dataset to create speech to text model. Indonesian Common Voice only 3 hrs speech, to create speech to text model, at least need 500 hrs of speech.
After have enough dataset you can start trainning to produce indonesian speech recognizer model.
Let's hope Capsule Network can reduce the need of 'big dataset'...
I have tried to use common voice dataset with this code. Still unable to use. for speech to text Can you guide me what changes need to be marked ? Specially any data other than English. If anyone can help.