open-mmlab / Amphion

Amphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation. Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audio, music, and speech generation research and development.
https://openhlt.github.io/amphion/
MIT License
4.28k stars 363 forks source link

[Help]: Some questions about the SVC model #187

Closed Yuki-zik closed 2 months ago

Yuki-zik commented 3 months ago

Problem Overview

I have two problems when using your model:

  1. Why is the como model larger than the teacher model in diffcomoSVC?
  2. If there are multiple singers in my data set, will the data of singer A improve the quality of singer B’s timbre conversion?
Lokshaw-Chau commented 2 months ago

Hi, @Yuki-zik.

  1. Actually a saved diffcomoSVC checkpoint contains a target model (EMA updated) and a student model (online updated). They are both the same architecture with the teacher model. We design in this manner for resume which requires the both to smoothly recover the latest training status. During inference, only the target model is activated.
  2. It's hard to give a determinated conclusion. But in my experience, a large multiple singers dataset often lead to a better conversion result for each singer. If you only target at one singer, it's always better to collect as much target singer data as possible. If you can't collect enough data, a possible solution is to pre-train a model on multiple singers dataset and then finetune the pre-trained model on your target singer dataset.
Yuki-zik commented 2 months ago
  1. Actually a saved diffcomoSVC checkpoint contains a target model (EMA updated) and a student model (online updated). They are both the same architecture with the teacher model. We design in this manner for resume which requires the both to smoothly recover the latest training status. During inference, only the target model is activated.
  2. It's hard to give a determinated conclusion. But in my experience, a large multiple singers dataset often lead to a better conversion result for each singer. If you only target at one singer, it's always better to collect as much target singer data as possible. If you can't collect enough data, a possible solution is to pre-train a model on multiple singers dataset and then finetune the pre-trained model on your target singer dataset.

I understand, thank you for your answer.