Note on CPU inference performance

geneing commented 5 years ago

I have a solution for slow inference on CPU. You should try setting environment variable OMP_NUM_THREADS=1 before running a python script. When pytorch is allowed to set the thread count to be equal to the number of CPU cores, it takes 10x longer to synthesize text.

It's really a problem with pytorch and blas libraries, not TTS. However, it leads to the perception that TTS inference is slow. I would suggest documenting it in the readme file.

erogol commented 5 years ago

It gave an almost 1sec boost for a 6 secs run. Nice trick! Do you have any reference to point about this issue?

mrgloom commented 5 years ago

Does it depends on hardware or sentence length?

I can't see any difference on 8 core mac notebook running on CPU:

Benchmark on 824c09120b6e6bf4be99d03cc07e611ecbf140fe on dev-tacotron2:

pretrained_models
├── ljspeech-260k
│   ├── checkpoint_260000.pth.tar
│   ├── config.json
│   └── events.out.tfevents.1554744767.erogol-desktop
└── mold_ljspeech_best_model
    ├── checkpoint_393000.pth.tar
    ├── checkpoint_433000.pth.tar
    └── config.json

CONFIG {'github_branch': 'dev-tacotron2', 'restore_path': '/media/erogol/data_ssd/Data/models/ljspeech_models/4241_best_t2_model/best_model.pth.tar', 'run_name': 'ljspeech', 'run_description': 'finetune 4241 for align with architectural changes', 'audio': {'num_mels': 80, 'num_freq': 1025, 'sample_rate': 22050, 'frame_length_ms': 50, 'frame_shift_ms': 12.5, 'preemphasis': 0.98, 'min_level_db': -100, 'ref_level_db': 20, 'power': 1.5, 'griffin_lim_iters': 60, 'signal_norm': True, 'symmetric_norm': False, 'max_norm': 1, 'clip_norm': True, 'mel_fmin': 0.0, 'mel_fmax': 8000.0, 'do_trim_silence': True}, 'distributed': {'backend': 'nccl', 'url': 'tcp://localhost:54321'}, 'reinit_layers': [], 'model': 'Tacotron2', 'grad_clip': 1, 'epochs': 1000, 'lr': 0.0001, 'lr_decay': False, 'warmup_steps': 4000, 'windowing': True, 'memory_size': 5, 'attention_norm': 'softmax', 'prenet_type': 'bn', 'use_forward_attn': True, 'transition_agent': False, 'loss_masking': False, 'enable_eos_bos_chars': True, 'batch_size': 16, 'eval_batch_size': 16, 'r': 1, 'wd': 1e-06, 'checkpoint': True, 'save_step': 1000, 'print_step': 100, 'tb_model_param_stats': True, 'batch_group_size': 8, 'run_eval': True, 'test_delay_epochs': 2, 'data_path': '/home/erogol/Data/LJSpeech-1.1', 'meta_file_train': 'metadata_train.csv', 'meta_file_val': 'metadata_val.csv', 'dataset': 'ljspeech', 'min_seq_len': 0, 'max_seq_len': 150, 'output_path': '/media/erogol/data_ssd/Data/models/ljspeech_models/', 'num_loader_workers': 8, 'num_val_loader_workers': 4, 'phoneme_cache_path': 'ljspeech_phonemes', 'use_phonemes': True, 'phoneme_language': 'en-us', 'text_cleaner': 'phoneme_cleaners'}

Default:
    Tacotron2 model load time: 0.51 sec
    Tacotron2 inference time warmup: 0.91 sec
    Tacotron2 inference time short sentence: 0.38 sec
    Tacotron2 inference time long sentence: 2.93 sec

os.environ["OMP_NUM_THREADS"] = "1" in python script
    Tacotron2 model load time: 0.51 sec
    Tacotron2 inference time warmup: 0.74 sec
    Tacotron2 inference time short sentence: 0.37 sec
    Tacotron2 inference time long sentence: 2.91 sec

Run like time OMP_NUM_THREADS=1 python demo_tts_cpu.py
    Tacotron2 model load time: 0.5 sec
    Tacotron2 inference time warmup: 0.69 sec
    Tacotron2 inference time short sentence: 0.37 sec
    Tacotron2 inference time long sentence: 2.91 sec

Runining via bash script with setted env variables:
cat run_tts_demo.sh
export MKL_NUM_THREADS=1
export NUMEXPR_NUM_THREADS=1
export OMP_NUM_THREADS=1
time python3 demo_tts_cpu.py
    Tacotron2 model load time: 0.51 sec
    Tacotron2 inference time warmup: 0.7 sec
    Tacotron2 inference time short sentence: 0.37 sec
    Tacotron2 inference time long sentence: 2.9 sec

My test code:

def load_tocotron_2_model():
    from utils.text.symbols import symbols, phonemes
    from models.tacotron2 import Tacotron2

    n_chars = len(phonemes) if CONFIG.use_phonemes else len(symbols) # 'use_phonemes': True
    print('n_chars', n_chars) #

    model = Tacotron2(num_chars=n_chars, r=CONFIG.r, attn_win=CONFIG.windowing, attn_norm=CONFIG.attention_norm,
                      prenet_type=CONFIG.prenet_type, forward_attn=CONFIG.use_forward_attn,
                      trans_agent=CONFIG.transition_agent)

    # load model state
    if use_cuda:
        cp = torch.load(MODEL_PATH)
    else:
        cp = torch.load(MODEL_PATH, map_location='cpu')

    # load the model
    model.load_state_dict(cp['model'])

    if use_cuda:
        model.cuda()

    model.eval() # Set eval mode

    print("cp['step']", cp['step']) #

    return model

def benchmark_tacotron2(model, text):
    start = time.time()

    waveform, alignment, mel_spec, mel_postnet_spec, stop_tokens = synthesis(model, text, CONFIG, use_cuda, ap,
                                                                             truncated=False,
                                                                             enable_eos_bos_chars=CONFIG.enable_eos_bos_chars,
                                                                             trim_silence=False)
    t = round(time.time()-start, 2)

    return t

if __name__ == '__main__':
    start = time.time()
    tacotron_model = load_tocotron_2_model()
    print('Tacotron2 model load time:', round(time.time()-start,2), 'sec')

    # Warmup
    t = benchmark_tacotron2(tacotron_model, "Warmup!")
    print('Tacotron2 inference time warmup:', t, 'sec')

    # Short sentence
    t = benchmark_tacotron2(tacotron_model, "Hello!")
    print('Tacotron2 inference time short sentence:', t, 'sec')

    # Long sentence
    t = benchmark_tacotron2(tacotron_model, "Quick brown fox jumps over the lazy dog.")
    print('Tacotron2 inference time long sentence:', t, 'sec')

Also as I can see here GL is done on CPU: https://github.com/mozilla/TTS/blob/618b2808127f6fd00fe643fe6e852ddf1d2986e1/utils/audio.py#L155

Also I wonder is this (https://github.com/mozilla/TTS/tree/dev-tacotron2#runtime) reported time for GL on GPU only or for Tacotron2 on GPU + GL on GPU, also is it's only inference time without model loading time? an for which sentence?

Also I wonder maybe something can be tuned in tacotron2 parameters for faster inference for short sentences?

geneing commented 5 years ago

@erogol I don't have a reference. I was trying to understand why inference on my computer was much slower than expected. While profiling I noticed many openmp library hits. I tried using openmp in other projects and performance was always abysmal - too much threading overhead. Lowering openmp thread count improved the perfomance.

@mrgloom Are pytorch and blas libraries built with OpenMP support on the Mac? I see an open feature request to add openmp support: https://github.com/pytorch/pytorch/issues/6328. You could check by looking at cpu utilization during inference. If it looks like only one core is used, then probably openmp is not used in your setup.

erogol commented 5 years ago

I can verify that it improves the performance slightly on Linux.

mrgloom commented 5 years ago

On Linux 12-core machine with installed cpu only version of pytorch under virtualenv:

Still can't see any speedup, actually it runs even slower. In default setting I see in htop that more than 1 core is busy. Also as I can see ap.inv_mel_spectrogram take more than half of time.

python3 -m venv my_env_cpu
source my_env_cpu/bin/activate
pip3 install https://download.pytorch.org/whl/cpu/torch-1.1.0-cp35-cp35m-linux_x86_64.whl
pip3 install torchvision
python -c "import torch; print(torch.__version__)"
1.1.0

Default:
    Tacotron2 model load time: 0.33 sec
    ap.inv_mel_spectrogram time: 0.49 sec
    Tacotron2 inference time warmup: 0.72 sec
    ap.inv_mel_spectrogram time: 0.2 sec
    Tacotron2 inference time short sentence: 0.37 sec
    ap.inv_mel_spectrogram time: 1.9 sec
    Tacotron2 inference time long sentence: 3.15 sec

Set os.environ["OMP_NUM_THREADS"] = "1" in python script
    Tacotron2 model load time: 0.34 sec
    ap.inv_mel_spectrogram time: 0.47 sec
    Tacotron2 inference time warmup: 0.74 sec
    ap.inv_mel_spectrogram time: 0.19 sec
    Tacotron2 inference time short sentence: 0.4 sec
    ap.inv_mel_spectrogram time: 1.88 sec
    Tacotron2 inference time long sentence: 3.52 sec

Run like time OMP_NUM_THREADS=1 python demo_tts_cpu.py
    Tacotron2 model load time: 0.35 sec
    ap.inv_mel_spectrogram time: 0.46 sec
    Tacotron2 inference time warmup: 0.74 sec
    ap.inv_mel_spectrogram time: 0.18 sec
    Tacotron2 inference time short sentence: 0.38 sec
    ap.inv_mel_spectrogram time: 1.87 sec
    Tacotron2 inference time long sentence: 3.49 sec

Runining via bash script with setted env variables:
cat run_tts_demo.sh
export MKL_NUM_THREADS=1
export NUMEXPR_NUM_THREADS=1
export OMP_NUM_THREADS=1
time python3 demo_tts_cpu.py
    Tacotron2 model load time: 0.35 sec
    ap.inv_mel_spectrogram time: 0.54 sec
    Tacotron2 inference time warmup: 0.83 sec
    ap.inv_mel_spectrogram time: 0.18 sec
    Tacotron2 inference time short sentence: 0.41 sec
    ap.inv_mel_spectrogram time: 1.87 sec
    Tacotron2 inference time long sentence: 3.51 sec

mrgloom commented 5 years ago

Also I have checked CPU vs GPU speed:

Model load+init and warmup model run is faster on CPU, but for longer sentence inference is faster on GPU. But not sure why ap.inv_mel_spectrogram time is also increased.

CPU:
Tacotron2 model load time: 0.34 sec
ap.inv_mel_spectrogram time: 0.5 sec
Tacotron2 inference time warmup: 0.73 sec
ap.inv_mel_spectrogram time: 0.2 sec
Tacotron2 inference time short sentence: 0.38 sec
ap.inv_mel_spectrogram time: 1.91 sec
Tacotron2 inference time long sentence: 3.17 sec

GPU:
Tacotron2 model load time: 3.89 sec
ap.inv_mel_spectrogram time: 0.87 sec
Tacotron2 inference time warmup: 1.45 sec
ap.inv_mel_spectrogram time: 0.41 sec
Tacotron2 inference time short sentence: 0.83 sec
ap.inv_mel_spectrogram time: 1.97 sec
Tacotron2 inference time long sentence: 2.88 sec

mrgloom commented 5 years ago

Also I have tried to add with torch.no_grad(): https://discuss.pytorch.org/t/model-eval-vs-with-torch-no-grad/19615/2 Here: https://github.com/mozilla/TTS/blob/dev-tacotron2/models/tacotron2.py#L42 and here: https://github.com/mozilla/TTS/blob/dev-tacotron2/models/tacotron2.py#L54

Seems times don't changed much:

CPU:
Tacotron2 model load time: 0.34 sec
ap.inv_mel_spectrogram time: 0.5 sec
Tacotron2 inference time warmup: 0.74 sec
ap.inv_mel_spectrogram time: 0.2 sec
Tacotron2 inference time short sentence: 0.37 sec
ap.inv_mel_spectrogram time: 1.93 sec
Tacotron2 inference time long sentence: 3.13 sec

GPU:
Tacotron2 model load time: 4.29 sec
ap.inv_mel_spectrogram time: 0.85 sec
Tacotron2 inference time warmup: 1.38 sec
ap.inv_mel_spectrogram time: 0.43 sec
Tacotron2 inference time short sentence: 0.95 sec
ap.inv_mel_spectrogram time: 2.02 sec
Tacotron2 inference time long sentence: 2.91 sec

zhang-yunke commented 5 years ago

@mrgloom Hi there, I am a little bit confused about your speed test. As in your result, after loading and warmup, tacotron2 inference spend half time on CPU than GPU for short sentence. Can you please post your test sample and your device parameter? I am facing inference on CPU problem too.

mrgloom commented 5 years ago

CPU: Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz
GPU: NVIDIA Corporation GM200 [GeForce GTX TITAN X] (rev a1)

"Hello!" - short sentence
"Quick brown fox jumps over the lazy dog." - long sentence

mrgloom commented 5 years ago

Actually I have 2 GPU in my system:

NVIDIA Corporation GM200 [GeForce GTX TITAN X] (rev a1)
NVIDIA Corporation Device 1b06 (rev a1) (it's actually GeForce GTX 1080 Ti)

GeForce GTX TITAN X is older generation and I don't know why it's faster. It can be faster on GPU (than on CPU) for long sentences, maybe it's because of RNN layers.

With removed GL part, only tacotron2 inference:

CPU:
    Tacotron2 model load time: 0.36 sec
    Tacotron2 inference time warmup: 0.34 sec
    Tacotron2 inference time short sentence: 0.18 sec
    Tacotron2 inference time long sentence: 1.19 sec

GPU:
    Tacotron2 model load time: 4.02 sec
    Tacotron2 inference time warmup: 0.64 sec
    Tacotron2 inference time short sentence: 0.58 sec
    Tacotron2 inference time long sentence: 1.1 sec

GPU:
    Setting device explicitly: torch.cuda.set_device(0)
    Tacotron2 model load time: 2.75 sec
    Tacotron2 inference time warmup: 0.59 sec
    Tacotron2 inference time short sentence: 0.56 sec
    Tacotron2 inference time long sentence: 1.03 sec

GPU: 
    Setting device explicitly: torch.cuda.set_device(1)
    Tacotron2 model load time: 4.91 sec
    Tacotron2 inference time warmup: 1.04 sec
    Tacotron2 inference time short sentence: 1.01 sec
    Tacotron2 inference time long sentence: 1.44 sec

Update: Actually

>>> torch.cuda.get_device_name(0)
'GeForce GTX 1080 Ti'
>>> torch.cuda.get_device_name(1)
'GeForce GTX TITAN X'

GeForce GTX 1080 Ti is faster then GeForce GTX TITAN X and it's make sense.

mrgloom commented 5 years ago

Effect of using torch.backends.cudnn.benchmark, looks like there is not much difference, but seems better to disable it.

torch.backends.cudnn.benchmark = False
torch.cuda.set_device(0)
    ------------------------------------------------------------
    Tacotron2 model load time: 2.69 sec
    ------------------------------------------------------------
    DEBUG: iters_count: 36
    Tacotron2 inference time warmup: 0.47 sec
    DEBUG: iters_count: 20
    Tacotron2 inference time short sentence: 0.48 sec
    DEBUG: iters_count: 223
    Tacotron2 inference time long sentence: 0.95 sec

    ------------------------------------------------------------
    Tacotron2 model load time: 2.8 sec
    ------------------------------------------------------------
    DEBUG: iters_count: 36
    Tacotron2 inference time warmup: 0.42 sec
    DEBUG: iters_count: 20
    Tacotron2 inference time short sentence: 0.43 sec
    DEBUG: iters_count: 223
    Tacotron2 inference time long sentence: 0.75 sec

    ------------------------------------------------------------
    Tacotron2 model load time: 2.7 sec
    ------------------------------------------------------------
    DEBUG: iters_count: 36
    Tacotron2 inference time warmup: 0.35 sec
    DEBUG: iters_count: 20
    Tacotron2 inference time short sentence: 0.4 sec
    DEBUG: iters_count: 223
    Tacotron2 inference time long sentence: 0.75 sec

torch.backends.cudnn.benchmark = False
torch.cuda.set_device(1)
    ------------------------------------------------------------
    Tacotron2 model load time: 19.48 sec
    ------------------------------------------------------------
    DEBUG: iters_count: 36
    Tacotron2 inference time warmup: 1.03 sec
    DEBUG: iters_count: 20
    Tacotron2 inference time short sentence: 0.68 sec
    DEBUG: iters_count: 223
    Tacotron2 inference time long sentence: 1.14 sec

    ------------------------------------------------------------
    Tacotron2 model load time: 5.0 sec
    ------------------------------------------------------------
    DEBUG: iters_count: 36
    Tacotron2 inference time warmup: 0.68 sec
    DEBUG: iters_count: 20
    Tacotron2 inference time short sentence: 0.61 sec
    DEBUG: iters_count: 223
    Tacotron2 inference time long sentence: 1.1 sec

    ------------------------------------------------------------
    Tacotron2 model load time: 5.06 sec
    ------------------------------------------------------------
    DEBUG: iters_count: 36
    Tacotron2 inference time warmup: 0.5 sec
    DEBUG: iters_count: 20
    Tacotron2 inference time short sentence: 0.44 sec
    DEBUG: iters_count: 223
    Tacotron2 inference time long sentence: 1.01 sec

torch.backends.cudnn.benchmark = True
torch.cuda.set_device(0)
    ------------------------------------------------------------
    Tacotron2 model load time: 2.7 sec
    ------------------------------------------------------------
    DEBUG: iters_count: 36
    Tacotron2 inference time warmup: 0.76 sec
    DEBUG: iters_count: 20
    Tacotron2 inference time short sentence: 0.43 sec
    DEBUG: iters_count: 223
    Tacotron2 inference time long sentence: 0.93 sec

    ------------------------------------------------------------
    Tacotron2 model load time: 2.87 sec
    ------------------------------------------------------------
    DEBUG: iters_count: 36
    Tacotron2 inference time warmup: 0.75 sec
    DEBUG: iters_count: 20
    Tacotron2 inference time short sentence: 0.47 sec
    DEBUG: iters_count: 223
    Tacotron2 inference time long sentence: 1.02 sec

    ------------------------------------------------------------
    Tacotron2 model load time: 2.92 sec
    ------------------------------------------------------------
    DEBUG: iters_count: 36
    Tacotron2 inference time warmup: 0.76 sec
    DEBUG: iters_count: 20
    Tacotron2 inference time short sentence: 0.53 sec
    DEBUG: iters_count: 223
    Tacotron2 inference time long sentence: 1.03 sec

torch.backends.cudnn.benchmark = True
torch.cuda.set_device(1)
    ------------------------------------------------------------
    Tacotron2 model load time: 4.85 sec
    ------------------------------------------------------------
    DEBUG: iters_count: 36
    Tacotron2 inference time warmup: 0.83 sec
    DEBUG: iters_count: 20
    Tacotron2 inference time short sentence: 0.77 sec
    DEBUG: iters_count: 223
    Tacotron2 inference time long sentence: 1.69 sec

    ------------------------------------------------------------
    Tacotron2 model load time: 5.23 sec
    ------------------------------------------------------------
    DEBUG: iters_count: 36
    Tacotron2 inference time warmup: 1.54 sec
    DEBUG: iters_count: 20
    Tacotron2 inference time short sentence: 0.71 sec
    DEBUG: iters_count: 223
    Tacotron2 inference time long sentence: 1.42 sec

    ------------------------------------------------------------
    Tacotron2 model load time: 4.91 sec
    ------------------------------------------------------------
    DEBUG: iters_count: 36
    Tacotron2 inference time warmup: 0.94 sec
    DEBUG: iters_count: 20
    Tacotron2 inference time short sentence: 0.73 sec
    DEBUG: iters_count: 223
    Tacotron2 inference time long sentence: 1.57 sec

mozilla / TTS

Note on CPU inference performance #194