Open eatsleepraverepeat opened 3 years ago
First, I would like to echo the kudos for publishing this port of VGGIsh. I am implementing a Fréchet Audio Distance (FAD) library and will definitely make use of it.
For anyone else who arrives here looking for a workaround, the final ReLU can be removed from the pretrained VGGish model with the following snippet:
vggish = pt.hub.load("harritaylor/torchvggish", "vggish")
vggish.embeddings = pt.nn.Sequential(*list(vggish.embeddings.children())[:-1])
Hello there,
when comparing this code to the one placed in tensorflow/models I've found that implementations use different layers as output of VGGish model (if considering activation as a separate layer),
yours: https://github.com/harritaylor/torchvggish/blob/46701162fd6b3684b6f6cf3b1afda100073850ae/torchvggish/vggish.py#L19
google's: https://github.com/tensorflow/models/blob/f32dea32e3e9d3de7ed13c9b16dc7a8fea3bd73d/research/audioset/vggish/vggish_slim.py#L104-L106 (
activation_fn=None
)Also, it's mentioned in README
Changing output layer of VGGish in your implementation to pre-activation one (w/o RELU) makes embeddings (almost) equal in both cases, - raw and PCA'ed ones.
Thanks for porting though, great work!