Question about speaker encoder input

cameronfr commented 7 months ago

The paper mentions thatThe tone color extractor is a simple 2D convolutional neural network that operates on the mel-spectrogram of the input voice and outputs a single feature vector that encodes the tone color information., but in api.py I see that it looks like it's operating on the non-mel spectrogram.

        for fname in ref_wav_list:
            audio_ref, sr = librosa.load(fname, sr=hps.data.sampling_rate)
            y = torch.FloatTensor(audio_ref)
            y = y.to(device)
            y = y.unsqueeze(0)
            y = spectrogram_torch(y, hps.data.filter_length,
                                        hps.data.sampling_rate, hps.data.hop_length, hps.data.win_length,
                                        center=False).to(device)
            with torch.no_grad():
                g = self.model.ref_enc(y.transpose(1, 2)).unsqueeze(-1)
                gs.append(g.detach())
        gs = torch.stack(gs).mean(0)

I'm wondering if this is true, and if so, if there was a reason for using the non-mel spectrogram (was quality better)?

Zengyi-Qin commented 7 months ago

Thanks for pointing out. This is true. There is actually not a performance difference between this two

cameronfr commented 7 months ago

Ah thank you and to clarify, the mel input in question was ~128 channels?

AbdulbariSoylemez commented 5 months ago

How can i optimize the audio cloning process how can i make a change to the def extract_se function?

myshell-ai / OpenVoice

Question about speaker encoder input #145