open-mmlab / Amphion

Amphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation. Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audio, music, and speech generation research and development.
https://openhlt.github.io/amphion/
MIT License
4.5k stars 386 forks source link

Quick (Singing) Voice Conversion #200

Closed CuiLvYing closed 2 months ago

CuiLvYing commented 5 months ago

✨ Description

This is an implementation of a simple Webui which provides a simple and quick text-free one-shot voice conversion for the uninitiated. Thereotically, the user only takes two short audios (source and target) and a few minutes to receive the VC result. It purposes to use the base model (checkpoint) trained from the VCTK, M4Singer datasets (or other supported datasets) as a foundation, and then fine-tune the base model using the input source audio for voice conversion and output. Now it supports MultipleContentSVC and VITS.

🚧 Related Issues

None

👨‍💻 Changes Proposed

If exists, please refer to the commits.

🧑‍🤝‍🧑 Who Can Review?

[Please use the '@' symbol to mention any community member who is free to review the PR once the tests have passed. Feel free to tag members or contributors who might be interested in your PR.] @zhizhengwu @RMSnow @Adorable-Qin

🛠 TODO

✅ Checklist

RMSnow commented 5 months ago

Hi @CuiLvYing, thanks for your efforts! Would you please attach some demos (such as the generated voices or your WebUI's video) like PR https://github.com/open-mmlab/Amphion/pull/56?

CuiLvYing commented 5 months ago

Of course! Here are some test demo videos or audios.

https://github.com/open-mmlab/Amphion/assets/166400963/f6ace087-bb14-4ae0-b5d5-1b82673af053

https://github.com/open-mmlab/Amphion/assets/166400963/423532c6-e190-4ec1-b73d-6217d614c483

https://github.com/open-mmlab/Amphion/assets/166400963/695eeb3e-da8d-4525-8f28-870c185f5bfe

https://github.com/open-mmlab/Amphion/assets/16

https://github.com/open-mmlab/Amphion/assets/166400963/ec5dd431-0e54-4466-85b9-d4833fd18684

6400963/f752ea9d-a950-4831-bd30-ffd9fb6fd6f5

You can even have a look at our running demo webui now: https://24a8ca30d15dff216c.gradio.live This test uses MultipleContentSVC and takes at least 200 seconds to output. However, I think our pre-trained model checkpoint has some flaws (not trained enough) and may not have a good effect, and sorry for that.

CuiLvYing commented 5 months ago

Sorry I find the using target audio not uploaded. Here is it:

https://github.com/open-mmlab/Amphion/assets/166400963/48b44ad1-7769-4c63-a0a9-c75e349db972

RMSnow commented 5 months ago

Hi @CuiLvYing, I'm confused about your samples. For VC, the converted audio will speak the source's content with the target's timbre. Please use your model to convert the samples of PR: https://github.com/open-mmlab/Amphion/pull/201. Then we can compare yours :)

CuiLvYing commented 5 months ago

I think we are attempting to make the person from "Infsource" speak content of the "target", and this is just opposite to your definition, and we'll soon amend this. Here are some audios after correction to the webui:

https://github.com/open-mmlab/Amphion/assets/166400963/5c0aa9fe-d8c2-4dc9-adb1-2aac2851e750

https://github.com/open-mmlab/Amphion/assets/166400963/3dc3d0c8-f6fa-4c51-9402-fe88102c7551

https://github.com/open-mmlab/Amphion/assets/166400963/351f0386-4d5a-49cc-8a6d-b64d13337cab

https://github.com/open-mmlab/Amphion/assets/166400963/19d06e7b-f84b-43cc-a6a2-a07a45d02b1e

https://github.com/open-mmlab/Amphion/assets/166400963/ac1d8e19-52a9-4bc1-8cfd-8e785d374042

https://github.com/open-mmlab/Amphion/assets/166400963/a0921064-a49a-403e-8c6c-8eb9c8edaf1e

https://github.com/open-mmlab/Amphion/assets/166400963/9925fa65-f38c-443b-8ac4-cfe88c8acb99

https://github.com/open-mmlab/Amphion/assets/166400963/7901ee53-7bca-44a8-b6d2-131515319537

https://github.com/open-mmlab/Amphion/assets/166400963/28f91872-0d56-498d-8f5a-209d9f770e5c

RMSnow commented 5 months ago

The naturalness, especically the intelligibility, is bad to me. So I recommend not to merge this PR unless there is a substantial improvement. @Adorable-Qin Please review the code and document carefully.