open-mmlab / Amphion

Amphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation. Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audio, music, and speech generation research and development.
https://openhlt.github.io/amphion/
MIT License
7.75k stars 589 forks source link

[Help]: It takes too long to load the model during inference #322

Open zziC7 opened 3 weeks ago

zziC7 commented 3 weeks ago

Problem Overview

When I run maskgct_inference.py, I found it takes too long to load the model during inference.

Steps Taken

  1. I record how long each step takes.

a. build stage

    start_time = time.time()

    # 1. build semantic model (w2v-bert-2.0)
    semantic_model, semantic_mean, semantic_std = build_semantic_model(device)
    # 2. build semantic codec
    semantic_codec = build_semantic_codec(cfg.model.semantic_codec, device)
    # 3. build acoustic codec
    codec_encoder, codec_decoder = build_acoustic_codec(
        cfg.model.acoustic_codec, device
    )
    # 4. build t2s model
    t2s_model = build_t2s_model(cfg.model.t2s_model, device)
    # 5. build s2a model
    s2a_model_1layer = build_s2a_model(cfg.model.s2a_model.s2a_1layer, device)
    s2a_model_full = build_s2a_model(cfg.model.s2a_model.s2a_full, device)

    end_time = time.time()
    build_time = end_time - start_time
    print(f"build_time: {build_time} seconds")

b. download stage

    start_time = time.time()
    # download checkpoint
    # download semantic codec ckpt
    semantic_code_ckpt = hf_hub_download(
        "amphion/MaskGCT", filename="semantic_codec/model.safetensors"
    )
    # download acoustic codec ckpt
    codec_encoder_ckpt = hf_hub_download(
        "amphion/MaskGCT", filename="acoustic_codec/model.safetensors"
    )
    codec_decoder_ckpt = hf_hub_download(
        "amphion/MaskGCT", filename="acoustic_codec/model_1.safetensors"
    )
    # download t2s model ckpt
    t2s_model_ckpt = hf_hub_download(
        "amphion/MaskGCT", filename="t2s_model/model.safetensors"
    )
    # download s2a model ckpt
    s2a_1layer_ckpt = hf_hub_download(
        "amphion/MaskGCT", filename="s2a_model/s2a_model_1layer/model.safetensors"
    )
    s2a_full_ckpt = hf_hub_download(
        "amphion/MaskGCT", filename="s2a_model/s2a_model_full/model.safetensors"
    )
    end_time = time.time()
    download_time = end_time - start_time
    print(f"download_time: {download_time} seconds")

c. load stage

    start_time = time.time()
    # load semantic codec
    safetensors.torch.load_model(semantic_codec, semantic_code_ckpt)
    # load acoustic codec
    safetensors.torch.load_model(codec_encoder, codec_encoder_ckpt)
    safetensors.torch.load_model(codec_decoder, codec_decoder_ckpt)
    # load t2s model
    safetensors.torch.load_model(t2s_model, t2s_model_ckpt)
    # load s2a model
    safetensors.torch.load_model(s2a_model_1layer, s2a_1layer_ckpt)
    safetensors.torch.load_model(s2a_model_full, s2a_full_ckpt)
    end_time = time.time()
    load_time = end_time - start_time
    print(f"load_time: {load_time} seconds")

d. inference stage

            start_time = time.time()
            recovered_audio = maskgct_inference_pipeline.maskgct_inference(
                prompt_wav_path, prompt_text, target_text, "zh", "zh", target_len=10
            )
            end_time = time.time()
            infer_time = end_time - start_time
            print(f"Inference time for line {line_num}: {infer_time} seconds")

Then I found:

build_time: 202.2886848449707 seconds
download_time: 60.34766387939453 seconds
load_time: 32.2975959777832 seconds
Inference time for line 2: 14.074496269226074 seconds

Expected Outcome

Does this mean that if I want to inference a piece of audio, then I have to wait a long time for the model to load? Or is there something wrong with my Settings?

JohnHerry commented 3 weeks ago

Problem Overview

When I run maskgct_inference.py, I found it takes too long to load the model during inference.

Steps Taken

  1. I record how long each step takes.

a. build stage

    start_time = time.time()

    # 1. build semantic model (w2v-bert-2.0)
    semantic_model, semantic_mean, semantic_std = build_semantic_model(device)
    # 2. build semantic codec
    semantic_codec = build_semantic_codec(cfg.model.semantic_codec, device)
    # 3. build acoustic codec
    codec_encoder, codec_decoder = build_acoustic_codec(
        cfg.model.acoustic_codec, device
    )
    # 4. build t2s model
    t2s_model = build_t2s_model(cfg.model.t2s_model, device)
    # 5. build s2a model
    s2a_model_1layer = build_s2a_model(cfg.model.s2a_model.s2a_1layer, device)
    s2a_model_full = build_s2a_model(cfg.model.s2a_model.s2a_full, device)

    end_time = time.time()
    build_time = end_time - start_time
    print(f"build_time: {build_time} seconds")

b. download stage

    start_time = time.time()
    # download checkpoint
    # download semantic codec ckpt
    semantic_code_ckpt = hf_hub_download(
        "amphion/MaskGCT", filename="semantic_codec/model.safetensors"
    )
    # download acoustic codec ckpt
    codec_encoder_ckpt = hf_hub_download(
        "amphion/MaskGCT", filename="acoustic_codec/model.safetensors"
    )
    codec_decoder_ckpt = hf_hub_download(
        "amphion/MaskGCT", filename="acoustic_codec/model_1.safetensors"
    )
    # download t2s model ckpt
    t2s_model_ckpt = hf_hub_download(
        "amphion/MaskGCT", filename="t2s_model/model.safetensors"
    )
    # download s2a model ckpt
    s2a_1layer_ckpt = hf_hub_download(
        "amphion/MaskGCT", filename="s2a_model/s2a_model_1layer/model.safetensors"
    )
    s2a_full_ckpt = hf_hub_download(
        "amphion/MaskGCT", filename="s2a_model/s2a_model_full/model.safetensors"
    )
    end_time = time.time()
    download_time = end_time - start_time
    print(f"download_time: {download_time} seconds")

c. load stage

    start_time = time.time()
    # load semantic codec
    safetensors.torch.load_model(semantic_codec, semantic_code_ckpt)
    # load acoustic codec
    safetensors.torch.load_model(codec_encoder, codec_encoder_ckpt)
    safetensors.torch.load_model(codec_decoder, codec_decoder_ckpt)
    # load t2s model
    safetensors.torch.load_model(t2s_model, t2s_model_ckpt)
    # load s2a model
    safetensors.torch.load_model(s2a_model_1layer, s2a_1layer_ckpt)
    safetensors.torch.load_model(s2a_model_full, s2a_full_ckpt)
    end_time = time.time()
    load_time = end_time - start_time
    print(f"load_time: {load_time} seconds")

d. inference stage

            start_time = time.time()
            recovered_audio = maskgct_inference_pipeline.maskgct_inference(
                prompt_wav_path, prompt_text, target_text, "zh", "zh", target_len=10
            )
            end_time = time.time()
            infer_time = end_time - start_time
            print(f"Inference time for line {line_num}: {infer_time} seconds")

Then I found:

build_time: 202.2886848449707 seconds
download_time: 60.34766387939453 seconds
load_time: 32.2975959777832 seconds
Inference time for line 2: 14.074496269226074 seconds

Expected Outcome

Does this mean that if I want to inference a piece of audio, then I have to wait a long time for the model to load? Or is there something wrong with my Settings?

Hi, what is the inference speed of this model? It said that it is NAR model structure, but there are two big model in this arch, I guess it will be no quicker then former AR based pipelines.

yuantuo666 commented 3 weeks ago

Hi, the model only needs to load the model once. You can use the Gradio demo or Jupyter Notebook to maintain the models in memory. Besides, since not all required dependencies are pre-downloaded, it still takes time to download them from the web. This only takes time when you first generate a specific language sentence.