harritaylor / torchvggish

Pytorch port of Google Research's VGGish model used for extracting audio features.
Apache License 2.0
377 stars 68 forks source link

Pre-activation as output of VGGish #24

Open eatsleepraverepeat opened 3 years ago

eatsleepraverepeat commented 3 years ago

Hello there,

when comparing this code to the one placed in tensorflow/models I've found that implementations use different layers as output of VGGish model (if considering activation as a separate layer),

yours: https://github.com/harritaylor/torchvggish/blob/46701162fd6b3684b6f6cf3b1afda100073850ae/torchvggish/vggish.py#L19

google's: https://github.com/tensorflow/models/blob/f32dea32e3e9d3de7ed13c9b16dc7a8fea3bd73d/research/audioset/vggish/vggish_slim.py#L104-L106 (activation_fn=None)

Also, it's mentioned in README

Note that the embedding layer does not include a final non-linear activation, so the embedding value is pre-activation

Changing output layer of VGGish in your implementation to pre-activation one (w/o RELU) makes embeddings (almost) equal in both cases, - raw and PCA'ed ones.

Thanks for porting though, great work!

brentspell commented 2 years ago

First, I would like to echo the kudos for publishing this port of VGGIsh. I am implementing a Fréchet Audio Distance (FAD) library and will definitely make use of it.

For anyone else who arrives here looking for a workaround, the final ReLU can be removed from the pretrained VGGish model with the following snippet:

vggish = pt.hub.load("harritaylor/torchvggish", "vggish")
vggish.embeddings = pt.nn.Sequential(*list(vggish.embeddings.children())[:-1])