Open kdrkdrkdr opened 5 days ago
Hi kdrkdrkdr,
Currently, I haven't implemented zero-shot functionality, but it can be achieved by utilizing a SpeakerEncoder in https://github.com/reppy4620/x-vits/blob/8bced31ea963245083df92e79442727daa787c74/src/x_vits/modules/encoder.py#L148-L176
For example, in the src/x_vits/models/xvits.py, while a style vector is generated by StyleDiffusion during inference, https://github.com/reppy4620/x-vits/blob/8bced31ea963245083df92e79442727daa787c74/src/x_vits/models/xvits.py#L137-L145 you can use a SpeakerEncoder to generate a style vector by inputting reference voice in inference like the same way in training https://github.com/reppy4620/x-vits/blob/8bced31ea963245083df92e79442727daa787c74/src/x_vits/models/xvits.py#L62-L64 It should be possible to implement zero-shot capability.
However, the current implementation only supports single-speaker corpus such as LJSpeech. Therefore, to execute this, you would need to prepare a Dataset class for multi-speaker data and preprocessing function to use diverse speakers corpus which can be achieved zero-shot inference.
Hello, thank you for your hard work. Does this project also have a zeroshot function for reference voice?
Thank you.