ChatTTS emb finetune with `DVAE_full.pt`

lenML / Speech-AI-Forge

🍦 Speech-AI-Forge is a project developed around TTS generation model, implementing an API Server and a Gradio-based WebUI.

https://huggingface.co/spaces/lenML/ChatTTS-Forge

GNU Affero General Public License v3.0

710 stars 87 forks source link

ChatTTS emb finetune with `DVAE_full.pt` #121

Open zhzLuke96 opened 2 months ago

zhzLuke96 commented 2 months ago

TODOs

[ ] Fix the speaker embedding finetuning code https://github.com/lenML/ChatTTS-Forge/blob/318d33f8d0b1451a39b3cbc94debca7f4f21dfca/modules/finetune/train_speaker.py#L15-L26
[ ] Use the encoder in DVAE_full.pt to compute loss
[ ] Add some simple tests

refs: #45

liyuanqi123 commented 2 months ago

是否打算尝试其他模块的微调？比如decoder部分，谢谢！

zhzLuke96 commented 2 months ago

是否打算尝试其他模块的微调？比如decoder部分，谢谢！

可能还会有llm模块lora微调，其他的暂时没打算搞

cpken commented 2 months ago

运行 webui.py 报错了

FileNotFoundError: [Errno 2] No such file or directory: './models/ChatTTS/asset/DVAE_full.pt'

看到这两处的配置有提到 DVAE_full.pt

modules/repos_static/ChatTTS/ChatTTS/config/config.py
scripts/dl_chattts.py

但是 https://huggingface.co/spaces/lenML/ChatTTS-Forge/tree/main/models/ChatTTS/asset 这里却没有，请补充 DVAE_full.pt 文件。

cpken commented 2 months ago

'./models/ChatTTS/asset/DVAE_full.pt' 从 https://huggingface.co/2Noise/ChatTTS 获得。

zhzLuke96 commented 2 months ago

运行 webui.py 报错了
FileNotFoundError: [Errno 2] No such file or directory: './models/ChatTTS/asset/DVAE_full.pt'
看到这两处的配置有提到 DVAE_full.pt

modules/repos_static/ChatTTS/ChatTTS/config/config.py

scripts/dl_chattts.py

但是 https://huggingface.co/spaces/lenML/ChatTTS-Forge/tree/main/models/ChatTTS/asset 这里却没有，请补充 DVAE_full.pt 文件。

下载模型/更新模型，请使用这个脚本

python -m scripts.dl_chattts --source huggingface

与此 issues 无关的问题请移步 discussions 谢谢

bokesyo commented 1 month ago

大神，想问个问题：如果是仅仅微调embedding的话，在有/无VAE Encoder权重的情况下，audio CE loss大概多少才能听起来很像本人说话呀？谢谢~

zhzLuke96 commented 1 month ago

大神，想问个问题：如果是仅仅微调embedding的话，在有/无VAE Encoder权重的情况下，audio CE loss大概多少才能听起来很像本人说话呀？谢谢~

没有准确的 loss 可以表示训练结束，一个是过拟合问题，一个是频域问题，某些频域是模型的“知识盲区”单单微调 embed 是没法做到"很像"的，所以最终loss也和你的数据集和数据频域有关。一般情况，你可以用几个随机的embed，用来测试你的数据集和当前模型的基础loss大概多少，训练能大幅小于这个基础loss基本上就是有效训练，至于多少loss能很像，这个得慢慢调了。