revsic / torch-nansypp

NANSY++: Unified Voice Synthesis with Neural Analysis and Synthesis
MIT License
139 stars 11 forks source link

Question about pitch loss #1

Open LEECHOONGHO opened 1 year ago

LEECHOONGHO commented 1 year ago

Hello, thank you for sharing your work.

In the paper, pitch loss is calculated by formula like below pitch_loss = huber_norm(log2(pred_shifted_f0) + 0.5 * d, log2(pred_f0), delta=???) But I can't understand why the scale of d is 0.5. I think if it represents the log-scale F0 difference between original and shifted F0, then it should be

fmin = 32.7
n_bins = 191
bins_per_octave = 24
cqt_frequencies = fmin * 2.0 ** (np.r_[0:n_bins] / np.float(bins_per_octave))
cqt_freq_max = cqt_frequencies[-1] # about 7901
log2_freq_per_d = (np.log2(cqt_freq_max) - np.log2(fmin)) / (n_bins - 1)
# log2_freq_per_d = 1/24 = 0.0416

Do you know why the scale of d is 0.5?

And did you proceeded train of nansy++? Did it generate good output or it's losses converged smoothly?

Thank you.

LEECHOONGHO commented 1 year ago

And I think delta should be less than 1.0. Even lower than d scale. In the reference paper of nansy++, delta is 0.25 * d_scale(sigma in the paper). https://arxiv.org/pdf/1910.11664.pdf

revsic commented 1 year ago

Since each bins of CQT correspond to 1 / bins_per_semitone semitone, CQT shifts derives -shit / bins_per_semitone pitch shift. In paper, writer assumes 24bins in octave, 2 bins per semitone, so we can write pitch consistency loss as 12 x (log2(pitch(CQT[shift:shift + interval])) - log2(pitch(CQT[:interval]))) = - shift / 2. I also wonder why they do not multiply the number of the semitones(=12) to difference of log pitch. I'm testing about it now, and I think I can share you the result in near future.

image

revsic commented 1 year ago

And thank you for your reference, I'll test the 0.25 x d_scale as delta value too. @LEECHOONGHO Sorry for my late reply

LEECHOONGHO commented 1 year ago

Thank you for your reply @revsic . I also have a question. In the paper, F0 is calculated by weighted sum of 64 bins of pitch encoder's output. which requires former softmax layer.

pitch_bin_weights = torch.linspace(np.log(50), np.log(1000), 64).exp())

f0, p_amp, ap_amp = torch.split(self.proj(x), [self.f0_bins, 1, 1], dim=-1)
f0 = torch.softmax(f0, dim=-1)
pitch = (f0 * self.pitch_bin_weights ).sum(dim=-1)

But when I trained model by this process, estimated pitch is much higher than ground truth pitch(about +300). And the output of the model's sample has no pitch harmonics, and periodic amplitude is fading wrong pitch.

Did the pitch encoder works well in your case?

image

As I presume, for softmax, It's hard to make a spiky distribution for target F0 value. So I changed f0 estimation process like below. As a result, the pitch and periodic amp value is reflected for audio synthetization.

# np.log(2) for gentle slop, threshold 50 for convenience...
F0 = self.exp_sigmoid(F0, log_exponent=np.log(2), max_value=950, threshold=50)

image

Thank you.

pranavmalikk commented 1 year ago

@LEECHOONGHO were you able to make any further progress on this? I've tried changing the loss function, the activation function but i could not get a solid reconstruction or pitch to converge nicely.

Edit: I started a run on 8xA100 with 32 batch size, i can post update at 700k steps if interested

talipturkmen commented 1 year ago

Hello @pranavmalikk, I spent so much time and compute to train this model. I think it's very data dependant. Since they train using private data it's impossible to replicate this. I'd suggest you to search for different papers to implement. This can be a complete waste of time like I did.

LEECHOONGHO commented 1 year ago

Hello, @talipturkmen @pranavmalikk . I'm sorry for the late response.

The training of the pitch estimation system claimed in the Paper did not work by any means. (In my case) The only way to properly train synthesizer module was to feed the ground truth pitch value measured by the world vocoder, to timbre encoder, frame level synthesizer, and waveform synthesizer. And training pitch estimator by MSE lossing it's output with GT pitch, and GT pitch confidence like fastspeech2.

And masking is required for timbre encoder`s AttentiveStatisticsPooling and Attention modules.

By the modification I explained, I finally got a voice-conversion-like model, but the sound quality is a bit off. So I'm planning to change the waveform synthesizer or change the loss function to the Avocodo paper`s ones.

choiHkk commented 1 year ago

The formula on page 3 of the paper should not be applied as written. You must apply semitone to F_{0}^{2} to transform the scale before applying log2, and then apply log2 for it to work.

LEECHOONGHO commented 1 year ago

@choiHkk 안녕하세요. 혹시 논문과 같은 구조의 NANSY++ 모델 학습에 성공하셨나요?

choiHkk commented 1 year ago

@LEECHOONGHO 논문대로 구현하면 학습이 제대로 안되는 것으로 파악됩니다. Pitch Encoder의 1d convolution layer architecture와 output dimension의 관계성이 모호하고, vocoder 및 training pipeline에 대한 objective function이 제대로 설명되어 있지 않아 Parallel WaveGAN을 그대로 적용하면 학습이 불안정합니다.

pranavmalikk commented 1 year ago

@choiHkk @LEECHOONGHO have you attempted to re-train this? i'm thinking of getting back into this. Other than "applying semitone to F_{0}^{2} to transform the scale before applying log2, and then apply log2 for it to work," is there anything else i need to look at? Maybe i need to look at the 1D convolution layer to the pitch encoder like @choiHkk stated?

choiHkk commented 1 year ago

@pranavmalikk If you apply the loss function to the module implemented by the repository owner, it will learn f0, but the f0 median will be high during initial training. If you use it in parallel with other estimators such as yaapt or pysptk, you can make it work stably during initial training. However, this method was not intended by the authors of the paper, so it may not be the correct method.