tugstugi / pytorch-dc-tts

Text to Speech with PyTorch (English and Mongolian)
MIT License
184 stars 79 forks source link

Speed Up Inference #5

Closed liberocks closed 5 years ago

liberocks commented 5 years ago

I had successfully trained TTS for Indonesia. Here I also attached the result of mine compared to original utterance. It's trained on 23400 audio (about 17 hours) using modified parameters as attached.

The result is impressive and very natural thanks to your great implementation. Yet I still don't satisfied with the time needed to infer the model. I believe it can be faster. Using GPU or half precision don't help. Can you give me hints to speed up this? Thank you.

Appendixes

Hyperparameters

"""Hyper parameters."""
__author__ = 'Erdene-Ochir Tuguldur'

class HParams:
    """Hyper parameters"""

    disable_progress_bar = False  # set True if you don't want the progress bar in the console

    logdir = "logdir"  # log dir where the checkpoints and tensorboard files are saved

    # audio.py options, these values are from https://github.com/Kyubyong/dc_tts/blob/master/hyperparams.py
    reduction_rate = 4  # melspectrogram reduction rate, don't change because SSRN is using this rate
    n_fft = 2048 # fft points (samples)
    n_mels = 80  # Number of Mel banks to generate
    power = 1.5  # Exponent for amplifying the predicted magnitude
    n_iter = 50  # Number of inversion iterations
    preemphasis = .97
    max_db = 100
    ref_db = 20
    sr = 16000  # Sampling rate
    frame_shift = 0.05  # seconds
    frame_length = 0.75  # seconds
    hop_length = int(sr * frame_shift)  # samples. =276.
    win_length = int(sr * frame_length)  # samples. =1102.
    max_N = 180  # Maximum number of characters.
    max_T = 210  # Maximum number of mel frames.

    e = 128  # embedding dimension
    d = 256  # Text2Mel hidden unit dimension
    c = 512+128  # SSRN hidden unit dimension

    dropout_rate = 0.05  # dropout

    # Text2Mel network options
    text2mel_lr = 0.005  # learning rate
    text2mel_max_iteration = 300000  # max train step
    text2mel_weight_init = 'none'  # 'kaiming', 'xavier' or 'none'
    text2mel_normalization = 'layer'  # 'layer', 'weight' or 'none'
    text2mel_basic_block = 'gated_conv'  # 'highway', 'gated_conv' or 'residual'
    text2mel_batchsize = 64

    # SSRN network options
    ssrn_lr = 0.0005  # learning rate
    ssrn_max_iteration = 150000  # max train step
    ssrn_weight_init = 'kaiming'  # 'kaiming', 'xavier' or 'none'
    ssrn_normalization = 'weight'  # 'layer', 'weight' or 'none'
    ssrn_basic_block = 'residual'  # 'highway', 'gated_conv' or 'residual'
    ssrn_batchsize = 24

Sample audio

result.zip

Inference Time

Character Count Average Duration (seconds) CPU Utilization (%)
15 < c ≤20 9 55.1
20 < c ≤ 25 9 38.1
25 < c ≤ 30 12 70.9
30 < c ≤ 35 12 71.9
35 < c ≤ 40 12 72.7
40 < c ≤ 45 12 72.7
45 < c ≤ 50 12 72.2
50 < c ≤ 55 15 72.4
55 < c ≤ 60 15 71.6
60 < c ≤ 65 15 71.4
65 < c ≤ 70 18 71.3
70 < c ≤ 75 21 70.7
75 < c ≤ 80 27 90.2
80 < c ≤ 85 18 70.6
85 < c ≤ 90 24 70
90 < c ≤ 95 24 70.1
95 < c ≤ 100 24 70.2
100 < c ≤ 105 24 69.9
105 < c ≤ 110 24 69.1
110 < c ≤ 115 27 69.2
115 < c ≤ 120 33 69.3
120 < c ≤ 125 33 77.4
125 < c ≤ 130 42 81.7
130 < c ≤ 135 48 81.2
135 < c ≤ 140 48 80.7
140 < c ≤ 145 63 84.1
145 < c ≤ 150 63 84
150 < c ≤ 155 81 82.7
155 < c ≤ 160 75 82.9
160 < c ≤ 165 72 81.6
165 < c ≤ 170 81 82.3
170 < c ≤ 175 87 83.1
175 < c ≤ 180 87 82.7
tugstugi commented 5 years ago

This repo is not actively maintained. Try NVidia Tacotron https://github.com/NVIDIA/tacotron2 which is much faster.