reppy4620 / x-vits

MIT License
5 stars 2 forks source link

Zeroshot-Feature? #1

Open kdrkdrkdr opened 5 days ago

kdrkdrkdr commented 5 days ago

Hello, thank you for your hard work. Does this project also have a zeroshot function for reference voice?

Thank you.

reppy4620 commented 5 days ago

Hi kdrkdrkdr,

Currently, I haven't implemented zero-shot functionality, but it can be achieved by utilizing a SpeakerEncoder in https://github.com/reppy4620/x-vits/blob/8bced31ea963245083df92e79442727daa787c74/src/x_vits/modules/encoder.py#L148-L176

For example, in the src/x_vits/models/xvits.py, while a style vector is generated by StyleDiffusion during inference, https://github.com/reppy4620/x-vits/blob/8bced31ea963245083df92e79442727daa787c74/src/x_vits/models/xvits.py#L137-L145 you can use a SpeakerEncoder to generate a style vector by inputting reference voice in inference like the same way in training https://github.com/reppy4620/x-vits/blob/8bced31ea963245083df92e79442727daa787c74/src/x_vits/models/xvits.py#L62-L64 It should be possible to implement zero-shot capability.

However, the current implementation only supports single-speaker corpus such as LJSpeech. Therefore, to execute this, you would need to prepare a Dataset class for multi-speaker data and preprocessing function to use diverse speakers corpus which can be achieved zero-shot inference.