Whether frame-level loss is used？

sky1456723 / Pytorch-MBNet

A pytorch implementation of MBNET: MOS PREDICTION FOR SYNTHESIZED SPEECH WITH MEAN-BIAS NETWORK

61 stars 4 forks source link

Whether frame-level loss is used？ #2

Closed liushenme closed 3 years ago

liushenme commented 3 years ago

Hi,

In the paper MBNET: MOS PREDICTION FOR SYNTHESIZED SPEECH WITH MEAN-BIAS NETWORK, the author said the frame-level loss borrowed form MOSNet was used. But I can't find it in your code. I reproduced the frame-level loss and found that the effect is not good. So I want to ask if you have used this loss and how effective it is.

Liu

sky1456723 commented 3 years ago

Hi, liushenme In line 112 to 122 in train.py, the model output is of shape (batch, seq_len) (line 116 and 117), and the label to calculate MSE is repeated to shape (batch, seq_len). So when we calculate the MSE, actually the frame-level loss is used.

If you think that there are some mistakes in these lines, please let me know. Thank you.

liushenme commented 3 years ago

Hi, I think your code of the frame-level loss have some mistakes. Because your model only outputs the uttrance-level score without the frame-level scores, but the frame-level scores are needed in the frame-level loss in MOSNet. So I think your code just repeats uttrance-level score many times.

sky1456723 commented 3 years ago

The model output "mean score" in line 110 is the output of the LSTM and DNN according to model.py, so I think that the model output is frame-level (at least, I think that it is time-dependent). The "repeat" operation in line 116 is for the label. According to MOSNet (https://arxiv.org/pdf/1904.08352.pdf) eq. 1, I think the operation is correct. Do I misunderstand about the frame-level loss?

(I clean the code few hours ago, so the line number may change a little bit.)

liushenme commented 3 years ago

Thanks for your reply. I I re-read your code, i think the frame-level loss is correct. But in the MOSNet (https://arxiv.org/pdf/1904.08352.pdf) eq. 1, the first part uttrance-level loss was not used in your code. Have you tried to use the two loss together according to MOSNet？

sky1456723 commented 3 years ago

I have not tried to use the utterance-level loss since MBNet uses frame-level loss in the paper. But I think that the usage of judge bias is the reason why MBNet outperforms MOSNet. So I guess that the utterance-level loss may not improve MBNet greatly.

liushenme commented 3 years ago

I see, thank you.