myshell-ai / MeloTTS

High-quality multi-lingual text-to-speech library by MyShell.ai. Support English, Spanish, French, Chinese, Japanese and Korean.
MIT License
4.85k stars 631 forks source link

Calling melo CLI for "ZH" long coldstart times, even if cached #130

Open zihaolam opened 6 months ago

zihaolam commented 6 months ago

running this command, gets me a consistent output in approx 7 seconds: melo 我的名字叫小杨 dog.wav --language ZH

/Users/zihaolam/Projects/tts-editor/MeloTTS/melo/main.py:71: UserWarning: You specified a speaker but the language is English.
  warnings.warn("You specified a speaker but the language is English.")
loading pickled model from cache
loaded pickled model from cache, took 8.529947996139526
 > Text split to sentences.
我的名字叫小杨
 > ===========================
  0%|                                                                  | 0/1 [00:00<?, ?it/s]Building prefix dict from the default dictionary ...
Loading model from cache /var/folders/j4/zkddp3ms6493qzbf3qf7rfwr0000gn/T/jieba.cache
Loading model cost 0.406 seconds.
Prefix dict has been built successfully.
Some weights of the model checkpoint at bert-base-multilingual-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
/Users/zihaolam/Projects/tts-editor/MeloTTS/.venv/lib/python3.9/site-packages/torch/nn/functional.py:4522: UserWarning: MPS: The constant padding of more than 3 dimensions is not currently supported natively. It uses View Ops default implementation to run. This may have performance implications. (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/native/mps/operations/Pad.mm:472.)
  return torch._C._nn.pad(input, pad, mode, value)
/Users/zihaolam/Projects/tts-editor/MeloTTS/melo/commons.py:123: UserWarning: MPS: no support for int64 for min_max, downcasting to a smaller data type (int32/float32). Native support for int64 has been added in macOS 13.3. (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/native/mps/operations/ReduceOps.mm:612.)
  max_length = length.max()
100%|██████████████████████████████████████████████████████████| 1/1 [00:07<00:00,  7.51s/it]
def get_model_pkl_path(language: str):
    return os.path.join(os.path.dirname(__file__), f"model_{language}.pkl")

def get_model(language: str, device: str):
    model_pkl_path = get_model_pkl_path(language)
    if not os.path.exists(model_pkl_path):
        from melo.api import TTS

        model = TTS(language=language, device=device)
        with open(model_pkl_path, "wb") as f:
            pickle.dump(model, f)
    else:
        with open(model_pkl_path, "rb") as f:
            start = time.time()
            print("loading pickled model from cache")
            model = pickle.load(f)
            print("loaded pickled model from cache, took ", time.time()-start)
    return model

Using pickle for TTS Model still does not help and takes approx 7 seconds for TTS for a short sentence.

Is there a way to improve the speed or further cache anything to reduce this cold start?

The gradio web UI takes approx 1 second to generate the same text. However, I would like to use the CLI instead of running a python server. Is there a way to optimise anything such that the CLI takes same time as the web UI/server?