glory20h / VoiceLDM

VoiceLDM: Text-to-Speech with Environmental Context
Apache License 2.0
163 stars 8 forks source link

About speechT5 is trainable? #4

Open SuperiorDtj opened 5 months ago

SuperiorDtj commented 5 months ago

I found that in I found that in the training code, speecht5 can be trained. However, in the inference code, speecht5 is loaded with Microsoft's public weights. Could you please clarify whether training speecht5 affects the results?

glory20h commented 5 months ago

Hi, in the inference code, speecht5 is loaded initially with public weights, but the parameters are overwritten again with state_dict from the VoiceLDM checkpoint.

SuperiorDtj commented 5 months ago

Thanks for your quick reply!

SuperiorDtj commented 5 months ago

Hi, in the inference code, speecht5 is loaded initially with public weights, but the parameters are overwritten again with state_dict from the VoiceLDM checkpoint.

I have another question, if you don't mind answering. Can using a regular phoneme sequence embedding network instead of SpeechT5 achieve the same effect? In other words, is SpeechT5 necessary for modeling duration information? Or can a regular nn.embedder + Durator achieve similar results?

glory20h commented 5 months ago

No, using SpeechT5 isn't strictly necessary, any form of 'text encoder' would likely do the job. Also, regarding using a single nn.embedder before Durator, I believe it's possible, but the linguistic modeling performance would likely be quite poor.

SuperiorDtj commented 5 months ago

No, using SpeechT5 isn't strictly necessary, any form of 'text encoder' would likely do the job. Also, regarding using a single nn.embedder before Durator, I believe it's possible, but the linguistic modeling performance would likely be quite poor. Thanks for your advice! It's very helpful for my research!

SuperiorDtj commented 4 months ago

No, using SpeechT5 isn't strictly necessary, any form of 'text encoder' would likely do the job. Also, regarding using a single nn.embedder before Durator, I believe it's possible, but the linguistic modeling performance would likely be quite poor.

Have you tried freezing the parameters of SpeechT5? Or, is it necessary to update the text encoder parameters in this TTS modeling approach?

glory20h commented 4 months ago

I have tried both, and found that updating the text encoder's parameters led to better performance.

SuperiorDtj commented 4 months ago

I have tried both, and found that updating the text encoder's parameters led to better performance.

Thanks for your reply! It's very helpful for my research!