open-mmlab / Amphion

Amphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation. Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audio, music, and speech generation research and development.
https://openhlt.github.io/amphion/
MIT License
7.75k stars 589 forks source link

The performance of MaskGCT is not meeting expectations. #348

Open MonolithFoundation opened 1 week ago

MonolithFoundation commented 1 week ago

output.wav.zip

这个声音效果是啊会是啊?


    target_text = """大家好,我是雷军。我是小米科技的创始人,一个有梦想的年轻人。我在小米科技的前几年,一直是一个做硬件的人,现在我想做一个软件的人。我想做一个让每个人都能用上的手机,这是我的梦想。
现在,我想让每个人都开上汽车。
"""
    # inference
    infer(
        prompt_wav_path="data/leijun-prompt.wav",
        prompt_text="我第四次办年度演讲,前三次呢,前三次呢因为疫情的原因,都在小米科技园内举办。现场的人很少,这是第四次,我们仔细想了想,我们还是想办一个比较大的,",
        target_text=target_text,
        source_lang="zh",
        target_lang="zh",
        save_path="output/output.wav",
    )

infer就是ipytnotebook里面的代码,没有改任何东西。

yuantuo666 commented 1 week ago

Hi, please provide the prompt WAV file so we can check better.

BTW, we prefer English issues: https://github.com/open-mmlab/Amphion/issues/304#issuecomment-2446921654.

MonolithFoundation commented 1 week ago

Thank u!

I changed into another wav prompt, the result seems normal now. Still, want consult 2 questions:

  1. I want control the output audio length, is that possible to control? If the length given to short, will the voice fail to generate?
  2. Am wondering if the prompt voice have some background noice (such as background music), how will it effect the final result, any way to fix it?
yuantuo666 commented 1 week ago
  1. Yes. You can specify the target_len parameter which is in second unit. The range could be 0.8x-1.2x of the normal duration. If the duration is too small, model might missing some words or fail to generate.
  2. Check out this: https://github.com/open-mmlab/Amphion/issues/305#issuecomment-2483521445