keithito / tacotron

A TensorFlow implementation of Google's Tacotron speech synthesis with pre-trained model (unofficial)
MIT License
2.95k stars 959 forks source link

Mozilla Common voice Data #299

Open Tortoise17 opened 5 years ago

Tortoise17 commented 5 years ago

I have tried to use common voice dataset with this code. Still unable to use. for speech to text Can you guide me what changes need to be marked ? Specially any data other than English. If anyone can help.

liberocks commented 5 years ago

I never use Mozilla Common Voice btw, but the easiest way is to format your data in to LJSpeech format which consist of :

It worked for me when building TTS for another language. CMIIW.

Tortoise17 commented 5 years ago

I have to train for other language. Not English. !! So, and speech to text.

liberocks commented 5 years ago

LJSpeech format has nothing to do with language. It just format which, in the context of this repository, is used to form the dataset. Once I created for Indonesia language dataset, I followed the format as following :

ID,Transcription, Normalized Transcription 0001,1 2 3 sayang semuanya,satu dua tiga sayang semuanya 0002,saya suka baju yang berwarna merah muda,saya suka baju yang berwarna merah muda 0003,bali adalah pulau tujuan wisata no 1 di dunia,bali adalah pulau tujuan wisata nomor satu di dunia and so forth

But, LJSpeech is meant to be TTS (text-to-speech) dataset and I think it's not suitable used for STT (speech-to-text) as for recognizing speech you need various voice from many people. You're right using Mozilla Common Voice as the dataset, but I think this is not the correct repository to ask for. Please head to STT repo like Kaldi or CMUSphinx. It may answer your question : https://github.com/kaldi-asr/kaldi/tree/master/egs/commonvoice/s5

stefangrotz commented 5 years ago

It would be great if Common Voice would be supported out of the box. It looks like it slowly becomes the biggest language dataset for most languages and for many languages it is the only available one under a free license. One can easily select subsets from the dataset, e.g. only male voices with a certain accent. This makes the dataset very intersting for TTS.

Converting the data format is one thing. But the common voice dataset uses mp3. Would it be neccessary to convert everything into .wav - files to use the data in tacotron? Common voice also has no normalized transcription of the text, is that a problem?

japita-se commented 5 years ago

Hi stefanrotz. Did you get rid of your problem? I would train me too by using Mozilla mp3 files. Shall I convert to .wav?

stefangrotz commented 5 years ago

@japita-se no I havent solved this problem yet, I first need more information.

japita-se commented 5 years ago

Hi. I succeed in part. I converted all Mozilla mp3 files into wav by using sox. Now I'm training. Howere, after 3000 steps the synthesis is still pure noise. I think I need more steps. I am just wondering if the dataset is correct: remember that Mozilla Dataset is multispeaker ,while LibVox is single speaker. I wonder that the results will not bi good. Maybe @keithito can say something on this.

stefangrotz commented 5 years ago

@japita-se did you use the complete Common Voice dataset or did you select one single speaker? most TTS engines can only be trained by one single voice. Best do use the speaker with the biggest number of recordings.

el-tocino commented 5 years ago

Common voice recordings tend to be of lower quality then what you would ideally use for building a tts model. Background noise, static, bumps, hisses, etc., are present in many of them, and those seem to get magnified by tacotron. A single speaker using a studio recording setup would be the kind of source you'd ideally use.

wahyubram82 commented 4 years ago

I have tried to use common voice dataset with this code. Still unable to use. for speech to text Can you guide me what changes need to be marked ? Specially any data other than English. If anyone can help.

Bro, tacotron is used to make text to speech..., not speech to text. But if we can make model 'Indonesian speech model' to run with tacotron. Of course combine with any module to read Indonesian text and save it as wav file..., so we can easily create a dataset to create speech to text model. Indonesian Common Voice only 3 hrs speech, to create speech to text model, at least need 500 hrs of speech.

After have enough dataset you can start trainning to produce indonesian speech recognizer model.

Let's hope Capsule Network can reduce the need of 'big dataset'...