resemble-ai / Resemblyzer

A python package to analyze and compare voices with deep learning
Apache License 2.0
2.66k stars 419 forks source link

Cosine similarity is inconsistent with the cluster #42

Closed tranctan closed 3 years ago

tranctan commented 3 years ago

Hi, when I tried visualizing the voices, it is shown that there is one sample (female voice) that is actually far away from the male speaker's utterances (which is expected).

However, when I compute the cosine similarity between the female's utterance versus the male ones, the value is quite high (0.88). I don't know if I perform the cosine similarity correctly here.

embed_1 = encoder.embed_utterance(y1)
embed_2 = encoder.embed_utterance(y2)
cosine_sim = embed_1 @ embed_2

Any help is very much appreciated !

tranctan commented 3 years ago

I just figured out by chance that if we load the audio into numpy array (by librosa or scipy) in prior to feeding into preprocess_wav() function in resemblyzer.audio module, we need to make sure that we resample the data to 16,000Hz, or we can just feed the whole audio wav path to the preprocess_wav() instead.

This is trivial but really hard to find the mistake.