keithito / tacotron

A TensorFlow implementation of Google's Tacotron speech synthesis with pre-trained model (unofficial)
MIT License
2.95k stars 959 forks source link

Succeed to train THCHS30 in Chinese mandarin! #118

Closed begeekmyfriend closed 5 years ago

begeekmyfriend commented 6 years ago

Hi all. I have good news that I have succeeded to train THCHS30 in Chinese mandarin. The code is open on my repo with just a little modification against the master branch. In this repo I used pinyin phoneme as symbols and the evaluation input are also sentences in Chinese pinyin. Since the number of Chinese characters are enormous, we can use pypinyin to translate into finite English symbols plus number 12345 as tone of voice. It produces good effect in 90K steps. The evaluation audio are here: eval_audio_92k.zip And the pre-trained model is here. And the align graph seems to be very good. step-92000-align However, the dataset THCHS30 is with multi-speakers so that there are some problems in some evaluation by chance like silence. So I still need to find some better mandarin dataset with single speaker and long hours enough for training. In all, this tacotron project is very good for me, thanks a lot.

lkfo415579 commented 6 years ago

@begeekmyfriend thanks for your contribution. But i discovered a problem. pypinyin's output will not be exactly the same as THCHS30's pinyin due to heteronym. do you have any idea?

For an instance: 可谁知纹完后她一照镜子只见左下眼睑的线又粗又黑与右侧明显不对称 可 谁知 纹 完 后 她 一 照镜子 只见 左下 眼睑 的 线 又 粗 又 黑 与 右侧 明显 不对称

PYPINYIN (Using style=TONE3)

ke3 shei2 zhi1 wen2 wan2 hou4 ta1 yi1 zhao4 jing4 zi zhi3 jian4 zuo3 xia4 yan3 jian3 de xian4 you4 cu1 you4 hei1 yu3 you4 ce4 ming2 xian3 bu2 dui4 cheng1

THCHS30S SAMPLE

ke3 shui2 zhi1 wen2 wan2 hou4 ta1 yi1 zhao4 jing4 zi5 zhi3 jian4 zuo3 xia4 yan2 jian3 de5 xian4 you4 cu1 you4 hei1 yu3 you4 ce4 ming2 xian3 bu2 dui4 chen4

begeekmyfriend commented 6 years ago

Small errors within big data do not matter in my opinion. I think the error belong to THCHS30.

emmacirl commented 6 years ago

@begeekmyfriend Hello begeekmyfriend, I have trained this model and I found that when I use long sentence there is a .wav file truncated. So I often synthesize incomplete sentences. Sometimes the synthesis of some abnormal sounds. Have you encountered this situation?

begeekmyfriend commented 6 years ago

You can try all trained models

begeekmyfriend commented 6 years ago

BTW, in dataset/thchs30.py I am afraid you have to fetch all the transcription files for feature pre-process and then feed into the tacotron model. Waves less than 30 hours are not enough for training since much of the speech content is the same for multi-speakers in this data set.

yyt233 commented 6 years ago

@begeekmyfriend 你好!请问一下你是否去除了thchs30数据集中的男声部分?还有没有对该数据集做其他处理呢?谢谢!

begeekmyfriend commented 6 years ago

最新版本去除了男声和太轻的音频,但这样一来才26h,时长明显不够,还是包含男声的全长34h够用

begeekmyfriend commented 6 years ago

I plan to use PyDub to augment wav data up to 50 hours for training data feeding.

yyt233 commented 6 years ago

@begeekmyfriend 但是我听了男声部分,有少部分数据相当于噪音部分,甚至有方言,质量不好,主要是A5和A9部分,建议去掉吗? 个人建议去得到app上下载音频数据,音频质量较高。

emmacirl commented 6 years ago

@begeekmyfriend 用thchs30语音库,“Loaded metadata for 13388 examples (34.20 hours)” 64k次迭代,长句子基本合成不了,短句子也是有时可以合成有时是合成出一段噪音。请问,您遇到过这种问题吗

lkfo415579 commented 6 years ago

@emmacirl 我也试过遇到这问题,没改任何代码。使用的是@begeekmyfriend 的repo。 step-140000-align attention alignment到现在(140K)还没对齐得很好。

begeekmyfriend commented 6 years ago

代码还是要改的,dataset/thchs30.py级数据量改成“13388 examples (34.20 hours)”吧。

另外alignment显示以1楼为佳,否则就是训练数据量不够。如果对应model不完美,试试其它(哪怕之前)model。

可以用1楼提供的model继续训练你自己的语料,使用checkpoint选项。

可以考虑下载其它音频数据,但你需要手动分句,拼音标注等,工作量十分大,不如采用augmentation技术对原始音频进行摄动,并自我复制,扩大数据量。

begeekmyfriend commented 6 years ago

男声部分虽然说不纯净,但手头数据量有限的不利情况下,还是可以考虑一并作为训练数据的。

至于alignment,个人经验,thchs30不是每次都出现好效果,但也有一定几率出现1楼结果,这同数据本身特征有关。一般如果20K steps还没出现明显alignment,可以考虑重来

yyt233 commented 6 years ago

@begeekmyfriend 你好! 代码还是要改的,dataset/thchs30.py级数据量改成“13388 examples (34.20 hours)”吧。

我对这句有些不懂,thchs30.py中没有涉及数据量吧?求解答~谢谢

begeekmyfriend commented 6 years ago

改成*.trn

yyt233 commented 6 years ago

@begeekmyfriend 谢谢! 我刚看到你的代码更新,代码排除了B6和C6作为训练集,但这两个都是女声~

emmacirl commented 6 years ago

@begeekmyfriend 我觉得训练时测试的句子都挺长的,听着也不错。就是用demo_server合成时效果不好。另外,我自己找了个中文库,按thchs30的格式存成 .trn 和. wav ,其中trn文件中有两行,一行文本,第二行拼音,其他没有变动,训练过程中报错pu [2018-02-28 12:34:20.478] Step 50 [1.985 sec/step, loss=0.94093, avg_loss=1.04232] [2018-02-28 12:34:21.206] Step 51 [1.961 sec/step, loss=0.89600, avg_loss=1.03946] [2018-02-28 12:34:22.411] Step 52 [1.946 sec/step, loss=0.93554, avg_loss=1.03746] [2018-02-28 12:34:23.218] Step 53 [1.925 sec/step, loss=0.86981, avg_loss=1.03429] [2018-02-28 12:34:24.163] Exiting due to exception: Incompatible shapes: [32,1545,80] vs. [32,1500,80] 问题定位在这一句: self.mel_loss = tf.reduce_mean(tf.abs(self.mel_targets - self.mel_outputs))

大概是说二者的维度不一致。你之前将thchs30加入时可有这种问题?不太理解为什么二者维度会不一致??

begeekmyfriend commented 6 years ago

During eval and training, audio length is limited to max_iters outputs_per_step frame_shift_ms milliseconds. With the defaults (max_iters=200, outputs_per_step=5, frame_shift_ms=12.5), this is 12.5 seconds.

If your training examples are longer, you will see an error like this: Incompatible shapes: [32,1340,80] vs. [32,1000,80]

To fix this, you can set a larger value of max_iters by passing --hparams="max_iters=300" to train.py (replace "300" with a value based on how long your audio is and the formula above).

emmacirl commented 6 years ago

@begeekmyfriend thanks a lot , I tried to find this problem for a afternoon, thanks!

yyt233 commented 6 years ago

tim 20180301100257 不知道有没有人遇到过这种情况,有什么解决的小技巧?谢谢!

yyt233 commented 6 years ago

@begeekmyfriend 你好,请问有没有遇到过我这种情况,目前已经训练了四万步,alignment依然如上图所示。

begeekmyfriend commented 6 years ago

从未遇到,估计是数据前处理方面错误

liyz15 commented 6 years ago

I think the silence problem is caused by text rather than duration or multi-speaker. There is a lot of duplicate text read by different speakers in THCHS30. I tried another smaller multi-speaker data, after 73k steps the evaluations still sounds robotic but never got silence results.

wlxzt commented 6 years ago

Have you tried Tacotron2 for mandarin?

begeekmyfriend commented 6 years ago

@liyz15 I have tested some sentences with the same content of some wav files in the THCHS30 data set where there is silence in the beginning and ending span. And then I also tested other sentences with same content but there is no silence in the corresponding wav files. I found it is the silence in data set that results in the problem in evaluation.

zuoxiang95 commented 6 years ago

@wlxzt I'm trying to use Tacotron2 for mandarin. But I have not got good results. The model is still training now.

HwarLee commented 6 years ago

@begeekmyfriend there are three lines in the file of *trn,how can I use the first line(chinese) to train? I'm not really understand the meaning of Non-English Data in TRAINING_DATA.md. I am anticipating your reply,thank u!

begeekmyfriend commented 6 years ago

@HwarLee I recommend that you should read the second line as pinyin transcription as tokens for embedding. The number of Chinese characters is so enormous that it is too hard for brute force. So we'd better transform them into finite latin characters as symbol tokens.

begeekmyfriend commented 6 years ago

Hi all, a great improvement has been made! I have changed the hyper parameters such as num_mels, num_freq, frame_length_ms as well as frame_shift_ms according to audio sample rate.

You know that the sample rate of THCHS30 is 16KHz instead of the original 22.5KHz for LJ Speech. That means less samples during same duration. Therefore we can increase the time span in a frame and reduce the number of points for FFT. Moreover we can keep the same value of max_iters for a piece of audio with same duration as that of 22.5KHz.

It has also brought faster convergence. The alignment showed at less than 15K steps for THCHS30 with which I just feed part of its files with 25h length (the original total length is 34h) as long as that of LJ Speech. Works like charm! It seems the less sample rate the audio is, the more efficient the training would run. step-13000-align step-25000-align @keithito I think you may supplement such approach in README. It might be very useful for users.

wotulong commented 6 years ago

Does any speech quality improve at eval step? Any samples?@begeekmyfriend

begeekmyfriend commented 6 years ago

The speech quality in evaluation depends largely on the training dataset. I said that the voice of THCHS30 does not match the full length of audio and therefore there will be silence output in the evaluation. I am afraid you still need your own mandarin dataset for training.

The improvement is in training efficiency. In my opinion parameters with 16KHz, 16-bit and mono channel are good enough for speech quality. Due to lower sample rate, Adjustments with less sampling points for FFT as well as less number of mel-specs and longer time duration in a frame can be applied to train. And this can reduce both the max input and output length of the training audio as well as the max iterations needed. The time consume for each step decreases in my training.

keithito commented 6 years ago

@begeekmyfriend: that's an interesting observation! I'll add something to the README.

zfowen commented 6 years ago

@begeekmyfriend,为什么要用thchs30?都不是一个人的录音。网上随便找点有声读物,一个人几十小时的录音还是很容易的。

begeekmyfriend commented 6 years ago

@zfowen 不敢用有版权的东西

zfowen commented 6 years ago

@begeekmyfriend,但是你目前还没有到考虑版权的步骤啊。用thchs30,不可能合成出理想的效果来,因为数据太差。效果差,你就不能确定技术上没问题,而认为仅是因为训练数据差造成结果差。先找个质量不错的数据,把效果弄好了,或者说把技术攻关搞定了。然后等发布的时候,换个有版权的数据就好了。

ghost commented 6 years ago

@zfowen 你好,有没有什么可用的有声读物呢?我找到有 希尔贝壳 的开源语料,不过根据之前 @begeekmyfriend 对于语料的条件筛选,希尔贝壳 的语料好像不太好 :(

begeekmyfriend commented 6 years ago

@zfowen 但是这需要大量前期工作,比如你需要将一大段音频分句,然后对每一句进行标注,把它变成可以训练的数据才行,而且数据量不小,个人经验起码25小时,上哪里去弄这么大的高质量数据? 如果你非要听一下效果,这篇微信公众号文章的末尾提供了一个例子,不过训练不多,只有133K次迭代。

ghost commented 6 years ago

@begeekmyfriend 133k 迭代的意思是什么?用 thchs30 训练到了 step 158000 就出现了这样的一个错误了: Step 158127 [2.044 sec/step, loss=0.09895, avg_loss=0.08429] Step 158128 [2.044 sec/step, loss=nan, avg_loss=nan] Loss exploded to nan at step 158128! :(

zfowen commented 6 years ago

@DavidAksnes,应该有很多。我刚才找了下,这个《中华上下五千年》李红岩播音,有30多小时,32kbps,有mp3配套的LRC文本。你看这个怎么样?

zfowen commented 6 years ago

@begeekmyfriend,可以看下我上面说的这个音频,它是有LRC的,相当于标注,及标注和音频的对齐都做好了。剩下的切分音频不是什么难事。

begeekmyfriend commented 6 years ago

@zfowen 是不错,不过量也很大了,没有一两个礼拜也搞不定。而且估计还要去除音乐,留下干音才行。 @DavidAksnes 你中间模型已经保存下来了,只要接着训练,用--restore_step这个选项即可。

zfowen commented 6 years ago

@begeekmyfriend,我听了好几段,都没听到音乐。应该是没有,或者极少的地方有。

zfowen commented 6 years ago

@begeekmyfriend,公众号最底下的sample我听了,效果是很不错的。请问是完全用thchs30的数据,训练出这样的效果?还是thchs30并结合了keithito的预训练模型,训练出这样的效果?

begeekmyfriend commented 6 years ago

@zfowen That is a private dataset with copyrights.

begeekmyfriend commented 6 years ago

Hi all, I am sorry maybe I was wrong. The modification of hyper parameters such as mel fbank does not help improve but impair the result of evaluation. Therefore I have changed it back into 80. Apologize for my mistakes. See my latest commit https://github.com/begeekmyfriend/tacotron/commit/cbc2b877e2674a1271ab226d03ba3de64f69a65e

ghost commented 6 years ago

@zfowen 《中华上下五千年》李红岩播音挺好的 :) 我是从这里下载了 mp3 和 lrc 档案:

https://pan.baidu.com/share/link?uk=3423868881&shareid=1622121731#list/path=%2F

不过 lrc 里的断句时间 与 实际音档 不匹配 :( 你有没有一些比较准确的断句文档呢?

yyt233 commented 6 years ago

我也听了推荐的 《中华上下五千年》李红岩 这个语料 个人建议:选取语料库时一定要注意语速变化,响度变化、停顿等等因素。 个人感觉这个语料库有些时候停顿特别长,有时停顿较短,有时语速很快,有时很慢,响度变化也是没有规律性,并且还有角色朗读。根据个人之前选取的20小时语料库的失败,不是很建议用这个。 但是也不清楚时间长一点的语料会不会有好的效果,如果有尝试,可以分享一下结果~

butterl commented 6 years ago

shell shell has open sourced it's Mandarin data not sure if this could be used to train the Mandarin with tacotron2

zfowen commented 6 years ago

@DavidAksnes,我没有 @yyt233,谢谢。再找找比较平稳的吧。有声读物那么多,应该有更合适的。

sunnnnnnnny commented 6 years ago

@begeekmyfriend 你好,我在thchs3数据集上跑了四五个小时,才跑了1000个step,batch_size=16,很疑惑