Closed CuiLvYing closed 2 months ago
Hi @CuiLvYing, thanks for your efforts! Would you please attach some demos (such as the generated voices or your WebUI's video) like PR https://github.com/open-mmlab/Amphion/pull/56?
Of course! Here are some test demo videos or audios.
https://github.com/open-mmlab/Amphion/assets/166400963/f6ace087-bb14-4ae0-b5d5-1b82673af053
https://github.com/open-mmlab/Amphion/assets/166400963/423532c6-e190-4ec1-b73d-6217d614c483
https://github.com/open-mmlab/Amphion/assets/166400963/695eeb3e-da8d-4525-8f28-870c185f5bfe
https://github.com/open-mmlab/Amphion/assets/16
https://github.com/open-mmlab/Amphion/assets/166400963/ec5dd431-0e54-4466-85b9-d4833fd18684
6400963/f752ea9d-a950-4831-bd30-ffd9fb6fd6f5
You can even have a look at our running demo webui now: https://24a8ca30d15dff216c.gradio.live This test uses MultipleContentSVC and takes at least 200 seconds to output. However, I think our pre-trained model checkpoint has some flaws (not trained enough) and may not have a good effect, and sorry for that.
Sorry I find the using target audio not uploaded. Here is it:
https://github.com/open-mmlab/Amphion/assets/166400963/48b44ad1-7769-4c63-a0a9-c75e349db972
Hi @CuiLvYing, I'm confused about your samples. For VC, the converted audio will speak the source's content with the target's timbre. Please use your model to convert the samples of PR: https://github.com/open-mmlab/Amphion/pull/201. Then we can compare yours :)
I think we are attempting to make the person from "Infsource" speak content of the "target", and this is just opposite to your definition, and we'll soon amend this. Here are some audios after correction to the webui:
https://github.com/open-mmlab/Amphion/assets/166400963/5c0aa9fe-d8c2-4dc9-adb1-2aac2851e750
https://github.com/open-mmlab/Amphion/assets/166400963/3dc3d0c8-f6fa-4c51-9402-fe88102c7551
https://github.com/open-mmlab/Amphion/assets/166400963/351f0386-4d5a-49cc-8a6d-b64d13337cab
https://github.com/open-mmlab/Amphion/assets/166400963/19d06e7b-f84b-43cc-a6a2-a07a45d02b1e
https://github.com/open-mmlab/Amphion/assets/166400963/ac1d8e19-52a9-4bc1-8cfd-8e785d374042
https://github.com/open-mmlab/Amphion/assets/166400963/a0921064-a49a-403e-8c6c-8eb9c8edaf1e
https://github.com/open-mmlab/Amphion/assets/166400963/9925fa65-f38c-443b-8ac4-cfe88c8acb99
https://github.com/open-mmlab/Amphion/assets/166400963/7901ee53-7bca-44a8-b6d2-131515319537
https://github.com/open-mmlab/Amphion/assets/166400963/28f91872-0d56-498d-8f5a-209d9f770e5c
The naturalness, especically the intelligibility, is bad to me. So I recommend not to merge this PR unless there is a substantial improvement. @Adorable-Qin Please review the code and document carefully.
✨ Description
This is an implementation of a simple Webui which provides a simple and quick text-free one-shot voice conversion for the uninitiated. Thereotically, the user only takes two short audios (source and target) and a few minutes to receive the VC result. It purposes to use the base model (checkpoint) trained from the VCTK, M4Singer datasets (or other supported datasets) as a foundation, and then fine-tune the base model using the input source audio for voice conversion and output. Now it supports MultipleContentSVC and VITS.
🚧 Related Issues
None
👨💻 Changes Proposed
If exists, please refer to the commits.
🧑🤝🧑 Who Can Review?
[Please use the '@' symbol to mention any community member who is free to review the PR once the tests have passed. Feel free to tag members or contributors who might be interested in your PR.] @zhizhengwu @RMSnow @Adorable-Qin
🛠 TODO
✅ Checklist