Is there an implementation of this anywhere that can be used to ouput embeddings of audio using any of th epretrained models, rather than classifications, so we could use these to train our own classifiers (e.g random forests) using these embeddings? Similar to how you can easily get a 128 embedding using VGGish.
Is there an implementation of this anywhere that can be used to ouput embeddings of audio using any of th epretrained models, rather than classifications, so we could use these to train our own classifiers (e.g random forests) using these embeddings? Similar to how you can easily get a 128 embedding using VGGish.