lucidrains / naturalspeech2-pytorch

Implementation of Natural Speech 2, Zero-shot Speech and Singing Synthesizer, in Pytorch
MIT License
1.26k stars 100 forks source link

Duration pitch loss not used. #34

Open jaykim9870 opened 7 months ago

jaykim9870 commented 7 months ago

Hello, I was looking into your code and it seems like the code does not consider the duration_pitch_loss.

https://github.com/lucidrains/naturalspeech2-pytorch/blob/659bec7f7543e7747e809e950cc2f84242fbeec7/naturalspeech2_pytorch/naturalspeech2_pytorch.py#L1522

Maybe, it might be related to the aux_loss you have made.

https://github.com/lucidrains/naturalspeech2-pytorch/blob/659bec7f7543e7747e809e950cc2f84242fbeec7/naturalspeech2_pytorch/naturalspeech2_pytorch.py#L1600

Thanks for the great work!

wonwooo commented 6 months ago

@jaykim9870 I have the same question. You're thinking that code should be changed like below. Right?

before : return loss + (self.rvq_cross_entropy_loss_weight * ce_loss) + duration_pitch_loss

fixed : return loss + (self.rvq_cross_entropy_loss_weight * ce_loss) + aux_loss

jaykim9870 commented 6 months ago

@wonwooo Yes, that would do.

FYI, There are some other issues like wavenet based diffusion model as the model size is very different from the original paper. As far as I have investigated, the model architecture is too different so it may affect the model performance. If you are working based on this project, you may also need to check those out!