Closed geneing closed 5 years ago
It gave an almost 1sec boost for a 6 secs run. Nice trick! Do you have any reference to point about this issue?
Does it depends on hardware or sentence length?
I can't see any difference on 8 core mac notebook running on CPU:
Benchmark on 824c09120b6e6bf4be99d03cc07e611ecbf140fe on dev-tacotron2:
pretrained_models
├── ljspeech-260k
│ ├── checkpoint_260000.pth.tar
│ ├── config.json
│ └── events.out.tfevents.1554744767.erogol-desktop
└── mold_ljspeech_best_model
├── checkpoint_393000.pth.tar
├── checkpoint_433000.pth.tar
└── config.json
CONFIG {'github_branch': 'dev-tacotron2', 'restore_path': '/media/erogol/data_ssd/Data/models/ljspeech_models/4241_best_t2_model/best_model.pth.tar', 'run_name': 'ljspeech', 'run_description': 'finetune 4241 for align with architectural changes', 'audio': {'num_mels': 80, 'num_freq': 1025, 'sample_rate': 22050, 'frame_length_ms': 50, 'frame_shift_ms': 12.5, 'preemphasis': 0.98, 'min_level_db': -100, 'ref_level_db': 20, 'power': 1.5, 'griffin_lim_iters': 60, 'signal_norm': True, 'symmetric_norm': False, 'max_norm': 1, 'clip_norm': True, 'mel_fmin': 0.0, 'mel_fmax': 8000.0, 'do_trim_silence': True}, 'distributed': {'backend': 'nccl', 'url': 'tcp://localhost:54321'}, 'reinit_layers': [], 'model': 'Tacotron2', 'grad_clip': 1, 'epochs': 1000, 'lr': 0.0001, 'lr_decay': False, 'warmup_steps': 4000, 'windowing': True, 'memory_size': 5, 'attention_norm': 'softmax', 'prenet_type': 'bn', 'use_forward_attn': True, 'transition_agent': False, 'loss_masking': False, 'enable_eos_bos_chars': True, 'batch_size': 16, 'eval_batch_size': 16, 'r': 1, 'wd': 1e-06, 'checkpoint': True, 'save_step': 1000, 'print_step': 100, 'tb_model_param_stats': True, 'batch_group_size': 8, 'run_eval': True, 'test_delay_epochs': 2, 'data_path': '/home/erogol/Data/LJSpeech-1.1', 'meta_file_train': 'metadata_train.csv', 'meta_file_val': 'metadata_val.csv', 'dataset': 'ljspeech', 'min_seq_len': 0, 'max_seq_len': 150, 'output_path': '/media/erogol/data_ssd/Data/models/ljspeech_models/', 'num_loader_workers': 8, 'num_val_loader_workers': 4, 'phoneme_cache_path': 'ljspeech_phonemes', 'use_phonemes': True, 'phoneme_language': 'en-us', 'text_cleaner': 'phoneme_cleaners'}
Default:
Tacotron2 model load time: 0.51 sec
Tacotron2 inference time warmup: 0.91 sec
Tacotron2 inference time short sentence: 0.38 sec
Tacotron2 inference time long sentence: 2.93 sec
os.environ["OMP_NUM_THREADS"] = "1" in python script
Tacotron2 model load time: 0.51 sec
Tacotron2 inference time warmup: 0.74 sec
Tacotron2 inference time short sentence: 0.37 sec
Tacotron2 inference time long sentence: 2.91 sec
Run like time OMP_NUM_THREADS=1 python demo_tts_cpu.py
Tacotron2 model load time: 0.5 sec
Tacotron2 inference time warmup: 0.69 sec
Tacotron2 inference time short sentence: 0.37 sec
Tacotron2 inference time long sentence: 2.91 sec
Runining via bash script with setted env variables:
cat run_tts_demo.sh
export MKL_NUM_THREADS=1
export NUMEXPR_NUM_THREADS=1
export OMP_NUM_THREADS=1
time python3 demo_tts_cpu.py
Tacotron2 model load time: 0.51 sec
Tacotron2 inference time warmup: 0.7 sec
Tacotron2 inference time short sentence: 0.37 sec
Tacotron2 inference time long sentence: 2.9 sec
My test code:
def load_tocotron_2_model():
from utils.text.symbols import symbols, phonemes
from models.tacotron2 import Tacotron2
n_chars = len(phonemes) if CONFIG.use_phonemes else len(symbols) # 'use_phonemes': True
print('n_chars', n_chars) #
model = Tacotron2(num_chars=n_chars, r=CONFIG.r, attn_win=CONFIG.windowing, attn_norm=CONFIG.attention_norm,
prenet_type=CONFIG.prenet_type, forward_attn=CONFIG.use_forward_attn,
trans_agent=CONFIG.transition_agent)
# load model state
if use_cuda:
cp = torch.load(MODEL_PATH)
else:
cp = torch.load(MODEL_PATH, map_location='cpu')
# load the model
model.load_state_dict(cp['model'])
if use_cuda:
model.cuda()
model.eval() # Set eval mode
print("cp['step']", cp['step']) #
return model
def benchmark_tacotron2(model, text):
start = time.time()
waveform, alignment, mel_spec, mel_postnet_spec, stop_tokens = synthesis(model, text, CONFIG, use_cuda, ap,
truncated=False,
enable_eos_bos_chars=CONFIG.enable_eos_bos_chars,
trim_silence=False)
t = round(time.time()-start, 2)
return t
if __name__ == '__main__':
start = time.time()
tacotron_model = load_tocotron_2_model()
print('Tacotron2 model load time:', round(time.time()-start,2), 'sec')
# Warmup
t = benchmark_tacotron2(tacotron_model, "Warmup!")
print('Tacotron2 inference time warmup:', t, 'sec')
# Short sentence
t = benchmark_tacotron2(tacotron_model, "Hello!")
print('Tacotron2 inference time short sentence:', t, 'sec')
# Long sentence
t = benchmark_tacotron2(tacotron_model, "Quick brown fox jumps over the lazy dog.")
print('Tacotron2 inference time long sentence:', t, 'sec')
Also as I can see here GL is done on CPU: https://github.com/mozilla/TTS/blob/618b2808127f6fd00fe643fe6e852ddf1d2986e1/utils/audio.py#L155
Also I wonder is this (https://github.com/mozilla/TTS/tree/dev-tacotron2#runtime) reported time for GL on GPU only or for Tacotron2 on GPU + GL on GPU, also is it's only inference time without model loading time? an for which sentence?
Also I wonder maybe something can be tuned in tacotron2 parameters for faster inference for short sentences?
@erogol I don't have a reference. I was trying to understand why inference on my computer was much slower than expected. While profiling I noticed many openmp library hits. I tried using openmp in other projects and performance was always abysmal - too much threading overhead. Lowering openmp thread count improved the perfomance.
@mrgloom Are pytorch and blas libraries built with OpenMP support on the Mac? I see an open feature request to add openmp support: https://github.com/pytorch/pytorch/issues/6328. You could check by looking at cpu utilization during inference. If it looks like only one core is used, then probably openmp is not used in your setup.
I can verify that it improves the performance slightly on Linux.
On Linux 12-core machine with installed cpu only version of pytorch under virtualenv:
Still can't see any speedup, actually it runs even slower.
In default setting I see in htop that more than 1 core is busy.
Also as I can see ap.inv_mel_spectrogram
take more than half of time.
python3 -m venv my_env_cpu
source my_env_cpu/bin/activate
pip3 install https://download.pytorch.org/whl/cpu/torch-1.1.0-cp35-cp35m-linux_x86_64.whl
pip3 install torchvision
python -c "import torch; print(torch.__version__)"
1.1.0
Default:
Tacotron2 model load time: 0.33 sec
ap.inv_mel_spectrogram time: 0.49 sec
Tacotron2 inference time warmup: 0.72 sec
ap.inv_mel_spectrogram time: 0.2 sec
Tacotron2 inference time short sentence: 0.37 sec
ap.inv_mel_spectrogram time: 1.9 sec
Tacotron2 inference time long sentence: 3.15 sec
Set os.environ["OMP_NUM_THREADS"] = "1" in python script
Tacotron2 model load time: 0.34 sec
ap.inv_mel_spectrogram time: 0.47 sec
Tacotron2 inference time warmup: 0.74 sec
ap.inv_mel_spectrogram time: 0.19 sec
Tacotron2 inference time short sentence: 0.4 sec
ap.inv_mel_spectrogram time: 1.88 sec
Tacotron2 inference time long sentence: 3.52 sec
Run like time OMP_NUM_THREADS=1 python demo_tts_cpu.py
Tacotron2 model load time: 0.35 sec
ap.inv_mel_spectrogram time: 0.46 sec
Tacotron2 inference time warmup: 0.74 sec
ap.inv_mel_spectrogram time: 0.18 sec
Tacotron2 inference time short sentence: 0.38 sec
ap.inv_mel_spectrogram time: 1.87 sec
Tacotron2 inference time long sentence: 3.49 sec
Runining via bash script with setted env variables:
cat run_tts_demo.sh
export MKL_NUM_THREADS=1
export NUMEXPR_NUM_THREADS=1
export OMP_NUM_THREADS=1
time python3 demo_tts_cpu.py
Tacotron2 model load time: 0.35 sec
ap.inv_mel_spectrogram time: 0.54 sec
Tacotron2 inference time warmup: 0.83 sec
ap.inv_mel_spectrogram time: 0.18 sec
Tacotron2 inference time short sentence: 0.41 sec
ap.inv_mel_spectrogram time: 1.87 sec
Tacotron2 inference time long sentence: 3.51 sec
Also I have checked CPU vs GPU speed:
Model load+init and warmup model run is faster on CPU, but for longer sentence inference is faster on GPU.
But not sure why ap.inv_mel_spectrogram
time is also increased.
CPU:
Tacotron2 model load time: 0.34 sec
ap.inv_mel_spectrogram time: 0.5 sec
Tacotron2 inference time warmup: 0.73 sec
ap.inv_mel_spectrogram time: 0.2 sec
Tacotron2 inference time short sentence: 0.38 sec
ap.inv_mel_spectrogram time: 1.91 sec
Tacotron2 inference time long sentence: 3.17 sec
GPU:
Tacotron2 model load time: 3.89 sec
ap.inv_mel_spectrogram time: 0.87 sec
Tacotron2 inference time warmup: 1.45 sec
ap.inv_mel_spectrogram time: 0.41 sec
Tacotron2 inference time short sentence: 0.83 sec
ap.inv_mel_spectrogram time: 1.97 sec
Tacotron2 inference time long sentence: 2.88 sec
Also I have tried to add with torch.no_grad():
https://discuss.pytorch.org/t/model-eval-vs-with-torch-no-grad/19615/2
Here:
https://github.com/mozilla/TTS/blob/dev-tacotron2/models/tacotron2.py#L42
and here:
https://github.com/mozilla/TTS/blob/dev-tacotron2/models/tacotron2.py#L54
Seems times don't changed much:
CPU:
Tacotron2 model load time: 0.34 sec
ap.inv_mel_spectrogram time: 0.5 sec
Tacotron2 inference time warmup: 0.74 sec
ap.inv_mel_spectrogram time: 0.2 sec
Tacotron2 inference time short sentence: 0.37 sec
ap.inv_mel_spectrogram time: 1.93 sec
Tacotron2 inference time long sentence: 3.13 sec
GPU:
Tacotron2 model load time: 4.29 sec
ap.inv_mel_spectrogram time: 0.85 sec
Tacotron2 inference time warmup: 1.38 sec
ap.inv_mel_spectrogram time: 0.43 sec
Tacotron2 inference time short sentence: 0.95 sec
ap.inv_mel_spectrogram time: 2.02 sec
Tacotron2 inference time long sentence: 2.91 sec
@mrgloom Hi there, I am a little bit confused about your speed test. As in your result, after loading and warmup, tacotron2 inference spend half time on CPU than GPU for short sentence. Can you please post your test sample and your device parameter? I am facing inference on CPU problem too.
CPU: Intel(R) Core(TM) i7-5820K CPU @ 3.30GHz
GPU: NVIDIA Corporation GM200 [GeForce GTX TITAN X] (rev a1)
"Hello!" - short sentence
"Quick brown fox jumps over the lazy dog." - long sentence
Actually I have 2 GPU in my system:
NVIDIA Corporation GM200 [GeForce GTX TITAN X] (rev a1)
NVIDIA Corporation Device 1b06 (rev a1) (it's actually GeForce GTX 1080 Ti)
GeForce GTX TITAN X is older generation and I don't know why it's faster. It can be faster on GPU (than on CPU) for long sentences, maybe it's because of RNN layers.
With removed GL part, only tacotron2 inference:
CPU:
Tacotron2 model load time: 0.36 sec
Tacotron2 inference time warmup: 0.34 sec
Tacotron2 inference time short sentence: 0.18 sec
Tacotron2 inference time long sentence: 1.19 sec
GPU:
Tacotron2 model load time: 4.02 sec
Tacotron2 inference time warmup: 0.64 sec
Tacotron2 inference time short sentence: 0.58 sec
Tacotron2 inference time long sentence: 1.1 sec
GPU:
Setting device explicitly: torch.cuda.set_device(0)
Tacotron2 model load time: 2.75 sec
Tacotron2 inference time warmup: 0.59 sec
Tacotron2 inference time short sentence: 0.56 sec
Tacotron2 inference time long sentence: 1.03 sec
GPU:
Setting device explicitly: torch.cuda.set_device(1)
Tacotron2 model load time: 4.91 sec
Tacotron2 inference time warmup: 1.04 sec
Tacotron2 inference time short sentence: 1.01 sec
Tacotron2 inference time long sentence: 1.44 sec
Update: Actually
>>> torch.cuda.get_device_name(0)
'GeForce GTX 1080 Ti'
>>> torch.cuda.get_device_name(1)
'GeForce GTX TITAN X'
GeForce GTX 1080 Ti
is faster then GeForce GTX TITAN X
and it's make sense.
Effect of using torch.backends.cudnn.benchmark
, looks like there is not much difference, but seems better to disable it.
torch.backends.cudnn.benchmark = False
torch.cuda.set_device(0)
------------------------------------------------------------
Tacotron2 model load time: 2.69 sec
------------------------------------------------------------
DEBUG: iters_count: 36
Tacotron2 inference time warmup: 0.47 sec
DEBUG: iters_count: 20
Tacotron2 inference time short sentence: 0.48 sec
DEBUG: iters_count: 223
Tacotron2 inference time long sentence: 0.95 sec
------------------------------------------------------------
Tacotron2 model load time: 2.8 sec
------------------------------------------------------------
DEBUG: iters_count: 36
Tacotron2 inference time warmup: 0.42 sec
DEBUG: iters_count: 20
Tacotron2 inference time short sentence: 0.43 sec
DEBUG: iters_count: 223
Tacotron2 inference time long sentence: 0.75 sec
------------------------------------------------------------
Tacotron2 model load time: 2.7 sec
------------------------------------------------------------
DEBUG: iters_count: 36
Tacotron2 inference time warmup: 0.35 sec
DEBUG: iters_count: 20
Tacotron2 inference time short sentence: 0.4 sec
DEBUG: iters_count: 223
Tacotron2 inference time long sentence: 0.75 sec
torch.backends.cudnn.benchmark = False
torch.cuda.set_device(1)
------------------------------------------------------------
Tacotron2 model load time: 19.48 sec
------------------------------------------------------------
DEBUG: iters_count: 36
Tacotron2 inference time warmup: 1.03 sec
DEBUG: iters_count: 20
Tacotron2 inference time short sentence: 0.68 sec
DEBUG: iters_count: 223
Tacotron2 inference time long sentence: 1.14 sec
------------------------------------------------------------
Tacotron2 model load time: 5.0 sec
------------------------------------------------------------
DEBUG: iters_count: 36
Tacotron2 inference time warmup: 0.68 sec
DEBUG: iters_count: 20
Tacotron2 inference time short sentence: 0.61 sec
DEBUG: iters_count: 223
Tacotron2 inference time long sentence: 1.1 sec
------------------------------------------------------------
Tacotron2 model load time: 5.06 sec
------------------------------------------------------------
DEBUG: iters_count: 36
Tacotron2 inference time warmup: 0.5 sec
DEBUG: iters_count: 20
Tacotron2 inference time short sentence: 0.44 sec
DEBUG: iters_count: 223
Tacotron2 inference time long sentence: 1.01 sec
torch.backends.cudnn.benchmark = True
torch.cuda.set_device(0)
------------------------------------------------------------
Tacotron2 model load time: 2.7 sec
------------------------------------------------------------
DEBUG: iters_count: 36
Tacotron2 inference time warmup: 0.76 sec
DEBUG: iters_count: 20
Tacotron2 inference time short sentence: 0.43 sec
DEBUG: iters_count: 223
Tacotron2 inference time long sentence: 0.93 sec
------------------------------------------------------------
Tacotron2 model load time: 2.87 sec
------------------------------------------------------------
DEBUG: iters_count: 36
Tacotron2 inference time warmup: 0.75 sec
DEBUG: iters_count: 20
Tacotron2 inference time short sentence: 0.47 sec
DEBUG: iters_count: 223
Tacotron2 inference time long sentence: 1.02 sec
------------------------------------------------------------
Tacotron2 model load time: 2.92 sec
------------------------------------------------------------
DEBUG: iters_count: 36
Tacotron2 inference time warmup: 0.76 sec
DEBUG: iters_count: 20
Tacotron2 inference time short sentence: 0.53 sec
DEBUG: iters_count: 223
Tacotron2 inference time long sentence: 1.03 sec
torch.backends.cudnn.benchmark = True
torch.cuda.set_device(1)
------------------------------------------------------------
Tacotron2 model load time: 4.85 sec
------------------------------------------------------------
DEBUG: iters_count: 36
Tacotron2 inference time warmup: 0.83 sec
DEBUG: iters_count: 20
Tacotron2 inference time short sentence: 0.77 sec
DEBUG: iters_count: 223
Tacotron2 inference time long sentence: 1.69 sec
------------------------------------------------------------
Tacotron2 model load time: 5.23 sec
------------------------------------------------------------
DEBUG: iters_count: 36
Tacotron2 inference time warmup: 1.54 sec
DEBUG: iters_count: 20
Tacotron2 inference time short sentence: 0.71 sec
DEBUG: iters_count: 223
Tacotron2 inference time long sentence: 1.42 sec
------------------------------------------------------------
Tacotron2 model load time: 4.91 sec
------------------------------------------------------------
DEBUG: iters_count: 36
Tacotron2 inference time warmup: 0.94 sec
DEBUG: iters_count: 20
Tacotron2 inference time short sentence: 0.73 sec
DEBUG: iters_count: 223
Tacotron2 inference time long sentence: 1.57 sec
I have a solution for slow inference on CPU. You should try setting environment variable OMP_NUM_THREADS=1 before running a python script. When pytorch is allowed to set the thread count to be equal to the number of CPU cores, it takes 10x longer to synthesize text.
It's really a problem with pytorch and blas libraries, not TTS. However, it leads to the perception that TTS inference is slow. I would suggest documenting it in the readme file.