keithito / tacotron

A TensorFlow implementation of Google's Tacotron speech synthesis with pre-trained model (unofficial)
MIT License
2.96k stars 957 forks source link

Arabic words (non-ascii characters) training data #195

Open yoosif0 opened 6 years ago

yoosif0 commented 6 years ago

Currently the training data looks like this

nawar-spec-00001.npy|nawar-mel-00001.npy|1223| وَرَجَّحَ التَّقْرِيرُ الَّذِي أَعَدَّهُ مَعْهَدُ أَبْحَاثِ هَضَبَةِ التِّبِتِ فِي الْأَكَادِيمِيَّةِ الصِّينِيَّةِ لِلْعُلُومِ - أَنْ تَسْتَمِرَّ دَرَجَاتُ الْحَرَارَةِ وَمُسْتَوَيَاتُ الرُّطُوبَةِ فِي الْإِرْتِفَاعِ طَوَالَ هَذَا الْقَرْنْ

Do you think that using phonetised words instead of arabic words would make training easier?

An example of training data of phonetised words is shown below which is similar to cumdict phonetised words

nawar-spec-00001.npy|nawar-mel-00001.npy|1223| W R A0 J A E A L T AE Q R E2 ..........

Thank you

keithito commented 6 years ago

I'm not sure. It would depend on how closely the Arabic alphabet maps to phonemes.