Issue with vggish checkpoint

v-iashin / SpecVQGAN

Source code for "Taming Visually Guided Sound Generation" (Oral at the BMVC 2021)

https://v-iashin.github.io/SpecVQGAN

MIT License

347 stars 40 forks source link

Issue with vggish checkpoint #13

Open luc-leonard opened 2 years ago

luc-leonard commented 2 years ago

Hello.

the vggishish_lpaps checkpoint is used here:

Errors are ignored in the code, but neither lpaps, nor vggishish manage to load it.

The checkpoint URL is here: https://github.com/v-iashin/SpecVQGAN/blob/eee222d8351df9b6314db69185d5ce8ca55b50c8/specvqgan/util.py#L8

The vggish weights can be found under the 'model' key, but I cannot find the lpaps weights anywhere in here. Are they not required ?

Best regards,

v-iashin commented 2 years ago

Hi, I checked the code and I think you are right! Thanks a lot for the catch! I will commit the fixes.

luc-leonard commented 2 years ago

Thanks you very much for the very quick answer and fix :D

jwliu-cc commented 2 years ago

The loss is going to 'nan' when i load the correct ckpt, do you have this problem? I trained on VAS dataset.

yangdongchao commented 2 years ago

Hi, I checked the code and I think you are right! Thanks a lot for the catch! I will commit the fixes.

Hi, I want to ask about the parameter of lpaps. The vggishish16 model is trained by vggsound. I want to know how you get the parameter of followwing layers? Whether you directly use the pre-trained model from taming transformer self.lin0 = NetLinLayer(self.chns[0], use_dropout=use_dropout) self.lin1 = NetLinLayer(self.chns[1], use_dropout=use_dropout) self.lin2 = NetLinLayer(self.chns[2], use_dropout=use_dropout) self.lin3 = NetLinLayer(self.chns[3], use_dropout=use_dropout) self.lin4 = NetLinLayer(self.chns[4], use_dropout=use_dropout)

v-iashin commented 2 years ago

You may train them by adapting https://github.com/richzhang/PerceptualSimilarity script.

yangdongchao commented 2 years ago

You may train them by adapting https://github.com/richzhang/PerceptualSimilarity script.

Can you share the code that you use vggsound dataset to train lpaps?

v-iashin commented 2 years ago

Ok, I managed to look into this issue for a bit more.

Thanks to your questions I discovered that this problem is actually deeper than I originally anticipated. It seems that I completely missed that NetLinLayer layers have trainable parameters and only relied on training VGGishish. I think because the code did not complain about loading the checkpoint, as the topic starter noticed, I just moved on.

What happens is that these layers are actually randomly inited and, luckily, the model could even train to such great quality — thanks to the GAN loss. This means, that you can just drop the perceptual loss from the model and it will train much faster and to the same performance. On the practical side, it seems that having this dorky loss you may still get a bit of a boost in quality.

yangdongchao commented 2 years ago

Ok, I managed to look into this issue for a bit more.

Thanks to your questions I discovered that this problem is actually deeper than I originally anticipated. It seems that I completely missed that NetLinLayer layers have trainable parameters and only relied on training VGGishish. I think because the code did not complain about loading the checkpoint, as the topic starter noticed, I just moved on.

What happens is that these layers are actually randomly inited and, luckily, the model could even train to such great quality — thanks to the GAN loss. This means, that you can just drop the perceptual loss from the model and it will train much faster and to the same performance. On the practical side, it seems that having this dorky loss you may still get a bit of a boost in quality.

Thanks for your reply. I understand it.

v-iashin commented 2 years ago

Today I had a chance to inspect the issue a bit more thanks to @jhyau.

It seems that @jwliu-cc was right and these fixes let codebook training diverge to nans. For this reason, I am resetting the commits mentioned in this issue to the initial well-tested state despite having this nasty bug with vggish and lpaps checkpoint loading 🙁 .

Current solution: perceptual_weight=0.0

This means that those who want to build upon SpecVQGAN could turn off the perceptual loss by setting the weight to zero and benefit from a significant speedup during training. This, however, would yield slightly different results which, according to our ablations, are still strong.

I also added a notice about it in README for other people to see.