Speed Up Inference - Githubissues

I had successfully trained TTS for Indonesia. Here I also attached the result of mine compared to original utterance. It's trained on 23400 audio (about 17 hours) using modified parameters as attached.

The result is impressive and very natural thanks to your great implementation. Yet I still don't satisfied with the time needed to infer the model. I believe it can be faster. Using GPU or half precision don't help. Can you give me hints to speed up this? Thank you.

Appendixes

Hyperparameters

"""Hyper parameters."""
__author__ = 'Erdene-Ochir Tuguldur'

class HParams:
    """Hyper parameters"""

    disable_progress_bar = False  # set True if you don't want the progress bar in the console

    logdir = "logdir"  # log dir where the checkpoints and tensorboard files are saved

    # audio.py options, these values are from https://github.com/Kyubyong/dc_tts/blob/master/hyperparams.py
    reduction_rate = 4  # melspectrogram reduction rate, don't change because SSRN is using this rate
    n_fft = 2048 # fft points (samples)
    n_mels = 80  # Number of Mel banks to generate
    power = 1.5  # Exponent for amplifying the predicted magnitude
    n_iter = 50  # Number of inversion iterations
    preemphasis = .97
    max_db = 100
    ref_db = 20
    sr = 16000  # Sampling rate
    frame_shift = 0.05  # seconds
    frame_length = 0.75  # seconds
    hop_length = int(sr * frame_shift)  # samples. =276.
    win_length = int(sr * frame_length)  # samples. =1102.
    max_N = 180  # Maximum number of characters.
    max_T = 210  # Maximum number of mel frames.

    e = 128  # embedding dimension
    d = 256  # Text2Mel hidden unit dimension
    c = 512+128  # SSRN hidden unit dimension

    dropout_rate = 0.05  # dropout

    # Text2Mel network options
    text2mel_lr = 0.005  # learning rate
    text2mel_max_iteration = 300000  # max train step
    text2mel_weight_init = 'none'  # 'kaiming', 'xavier' or 'none'
    text2mel_normalization = 'layer'  # 'layer', 'weight' or 'none'
    text2mel_basic_block = 'gated_conv'  # 'highway', 'gated_conv' or 'residual'
    text2mel_batchsize = 64

    # SSRN network options
    ssrn_lr = 0.0005  # learning rate
    ssrn_max_iteration = 150000  # max train step
    ssrn_weight_init = 'kaiming'  # 'kaiming', 'xavier' or 'none'
    ssrn_normalization = 'weight'  # 'layer', 'weight' or 'none'
    ssrn_basic_block = 'residual'  # 'highway', 'gated_conv' or 'residual'
    ssrn_batchsize = 24

Sample audio

result.zip

Inference Time

Character Count	Average Duration (seconds)	CPU Utilization (%)
15 < c ≤20	9	55.1
20 < c ≤ 25	9	38.1
25 < c ≤ 30	12	70.9
30 < c ≤ 35	12	71.9
35 < c ≤ 40	12	72.7
40 < c ≤ 45	12	72.7
45 < c ≤ 50	12	72.2
50 < c ≤ 55	15	72.4
55 < c ≤ 60	15	71.6
60 < c ≤ 65	15	71.4
65 < c ≤ 70	18	71.3
70 < c ≤ 75	21	70.7
75 < c ≤ 80	27	90.2
80 < c ≤ 85	18	70.6
85 < c ≤ 90	24	70
90 < c ≤ 95	24	70.1
95 < c ≤ 100	24	70.2
100 < c ≤ 105	24	69.9
105 < c ≤ 110	24	69.1
110 < c ≤ 115	27	69.2
115 < c ≤ 120	33	69.3
120 < c ≤ 125	33	77.4
125 < c ≤ 130	42	81.7
130 < c ≤ 135	48	81.2
135 < c ≤ 140	48	80.7
140 < c ≤ 145	63	84.1
145 < c ≤ 150	63	84
150 < c ≤ 155	81	82.7
155 < c ≤ 160	75	82.9
160 < c ≤ 165	72	81.6
165 < c ≤ 170	81	82.3
170 < c ≤ 175	87	83.1
175 < c ≤ 180	87	82.7

tugstugi / pytorch-dc-tts

Speed Up Inference #5

Appendixes

Hyperparameters

Sample audio

Inference Time