p0p4k / vits2_pytorch

unofficial vits2-TTS implementation in pytorch
https://arxiv.org/abs/2307.16430
MIT License
477 stars 85 forks source link

Bad quality audio when infer with custom condition #74

Closed huutuongtu closed 9 months ago

huutuongtu commented 9 months ago

Hi, I tried your code for training with both multi-speaker and single-speaker conditions, and it worked well for both training and inference. However, when I made some minor changes to model.py, modifying the forward and inference functions (e.g., replacing the speaker ID with speaker embeddings from a pre-trained speaker recognition model).

        if n_speakers > 1:
            self.emb_g = nn.Embedding(n_speakers, gin_channels)
            self.linlin = nn.Linear(768, gin_channels)

    def forward(self, x, x_lengths, y, y_lengths, sid=None):
        if self.n_speakers > 0:
            # g = self.emb_g(sid).unsqueeze(-1)  # [b, h, 1]
            g = sid
            g = self.linlin(g).unsqueeze(-1) # [b, h, 1]
        else:
            g = None

....

    def infer(
        self,
        x,
        x_lengths,
        sid=None,
        noise_scale=1,
        length_scale=1,
        noise_scale_w=1.0,
        max_len=None,
    ):

        if self.n_speakers > 0:
            # g = self.emb_g(sid).unsqueeze(-1)  # [b, h, 1]
            g = sid
            g = self.linlin(g).unsqueeze(-1) # [b, h, 1]
        else:
            g = None
        x, m_p, logs_p, x_mask = self.enc_p(x, x_lengths, g=g)

....

Training works well and the loss convergence: [2.560758352279663, 2.2537946701049805, 3.8962457180023193, 20.862136840820312, 0.8815252184867859, 2.2975285053253174, 24100, 0.00019459892692329838]

But when I infer, the quality of audio is so bad (audio can represent speakers style but can't represent any word). Do you have any idea for this?

sample: Text: Scarcely had he uttered the name when Pierre's closing eyes shot open Audio: https://drive.google.com/file/d/1OtWPVw82alLTV3n4i7e9kb1FaKPJ0OBW/view?usp=sharing

p0p4k commented 9 months ago

maybe some issue with the tokens ("x") that is going in, maybe added blanks during training and not during inference?