running inference with acoustic model [question]

openvpi / DiffSinger

An advanced singing voice synthesis system with high fidelity, expressiveness, controllability and flexibility based on DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism

Apache License 2.0

2.73k stars 288 forks source link

running inference with acoustic model [question] #137

Closed nestyme closed 1 year ago

nestyme commented 1 year ago

Hello! Thank you for this repository. I have a question, regarding inference input parameters. I want to run text+f0 to song engine. I trained the acoustic model, but looks like it still wants phoneme durations. So, is it correct that I cannot run inference for this version of diffsinger without gt phoneme durations for any input as in original repo (and this way diffsinger is more like voice conversion?) thank you

yqzhishen commented 1 year ago

You will need a G2P module/dictionary to convert text to phonemes and a duration predictor to get their durations, no matter which version of DiffSinger you are using. The original DiffSinger has a integrated Chinese G2P module and a duration predictor bounded to the acoustic model. In this repository you need to train a duration predictor in the variance model.

nestyme commented 1 year ago

@yqzhishen thank you so much for a quick and informative responses! I am sorry but do you have any documentation how to prepare/train variance model from scratch? I found migration documentation only in variance folder

yqzhishen commented 1 year ago

Prepare an acoustic dataset and extend it to a variance dataset - this is the standard workflow.

See the MakeDiffSinger repository for useful scripts and introductions.

nestyme commented 1 year ago

thank you!