philipperemy / deep-speaker

Deep Speaker: an End-to-End Neural Speaker Embedding System.
MIT License
905 stars 241 forks source link

Inquiry on embedding extractions for voice comparisons #105

Open PhilipAmadasun opened 10 months ago

PhilipAmadasun commented 10 months ago

If I want my program to recognize the voice of someone whose embedding I've already stored. Is it better that the stored embedding be extracted from a short or long .wav of the person speeking, for the effect of the model having an easier time identifying the voice correctly (at least to 0.75 to 0.8 probability match). Or does the length not matter to some extent? (for instance a 2 minute .wav file over a 5 minute or 10 minute .wav file). I want to compare the stored embedding with an embedding of the person speeking for say 5 seconds, 10 seconds and longer.

I'm using pre-trained model ResCNN_triplet_training_checkpoint_265.h5. Also how does this model handle noise?

philipperemy commented 10 months ago

@PhilipAmadasun the model was trained on the LibriSpeech dataset: https://www.openslr.org/12.

The utterances were (mostly) given without noise and were short (up to 10 seconds if I remember well).

I forgot in details but some wav were some were cut to fit the model.

So what you can do is:

PhilipAmadasun commented 10 months ago

@philipperemy Just to make sure I understand this particular aspect of my inquiry. The length of the .wav files the embeddings are extracted from do not matter?

For example, from the same person, I compare the embedding of a 20 second recording of their voice with the embedding extracted from a 10 minute recording of their voice, I get 0.8 probability match. Then I compare embeddings from the 20 second recording with a 5 second recording, then 2 minute recording. I should still get around 0.8 probability matches (ideally)?

If this is indeed the case then the lengths of audio don't matter, if the length doesn't matter then what properties of the audio files matter (besides noise)?

If the same person talks with a higher pitch on one .wav file than in the other, would their still be a strong probability match between the two? Most likely not right? I'm just trying to figure out properties of voice/.wav file actually matter for batch_cosine_similarity to give a strong cosine similarity match. I think this would help me figure out my other lines of questioning. I hope this question makes sense?

philipperemy commented 10 months ago

@PhilipAmadasun Yes it makes sense. I checked the code.

The model was trained on samples of 1.6 seconds (clear speech).

If you want the most robust result for inference, for any speaker, you should sample many wav segments of 1.6 seconds and you should average them. This will be the speaker vector.

The way to test it, would be:

https://github.com/philipperemy/deep-speaker/blob/master/deep_speaker/constants.py#L17C1-L17C88

To answer your questions:

The length of the .wav files the embeddings are extracted from do not matter?

It does. cf. my answers below.

For example, from the same person, I compare the embedding of a 20 second recording of their voice with the embedding extracted from a 10 minute recording of their voice, I get 0.8 probability match. Then I compare embeddings from the 20 second recording with a 5 second recording, then 2 minute recording. I should still get around 0.8 probability matches (ideally)?

Yes ideally but you should make the comparisons on the same segment length which is around 1~2 seconds. If the recording is 1min, you can sample 60 files of 1s and average them. If the recording is 20s you can sample 20 times. And then you can compare the vector of the 1min with the vector of the 20s.

If this is indeed the case then the lengths of audio don't matter, if the length doesn't matter then what properties of the audio files matter (besides noise)?

Yes it does matter because the longer the recording is, the more stable the speaker vector should be. Indeed, the more files we average, the more consistent the vector estimation should be.

If the same person talks with a higher pitch on one .wav file than in the other, would their still be a strong probability match between the two? Most likely not right? I'm just trying to figure out properties of voice/.wav file actually matter for batch_cosine_similarity to give a strong cosine similarity match. I think this would help me figure out my other lines of questioning. I hope this question makes sense?

Most likely not right I'd say. That's why averaging across multiple recording might be the best way to really capture the voice properties of the speaker.

Also make sure you use wav with a sampling rate of 16,000 Hz.

PhilipAmadasun commented 10 months ago

@philipperemy When you say "averaging" do you literally mean element wise averaging? On a different not, how do I make sure deep-speaker has CUDA access? Is there a way of knowing?

philipperemy commented 10 months ago

@PhilipAmadasun yeah a simple np.sum(x, axis=0).

It relies on keras/tensorflow. https://stackoverflow.com/questions/38009682/how-to-tell-if-tensorflow-is-using-gpu-acceleration-from-inside-python-shell.

PhilipAmadasun commented 10 months ago

@PhilipAmadasun I have some issues with tensorflow, so I'm gonna create another issue for it. Please still keep this issue open as I do my tests.

philipperemy commented 10 months ago

okay cool.

PhilipAmadasun commented 10 months ago

@philipperemy I might place this inquiry in the new issue I raised but I thought I would briefly ask here. Is it possible that tensorflow can be replaced with straight pytorch for deepspeaker. Or was there some specific reason you used tensorflow.

philipperemy commented 10 months ago

It would require a lot of work to port it to pytorch. So I'd say it's not possible. At that time, pytorch did not exist I guess lol

PhilipAmadasun commented 9 months ago

@philipperemy Is there a chance that you are working on a way for deespeaker to handle simultanuous cosine similarity calculations. As in let's say the user want's to compare a voice embedding with several saved voice embeddings at once. Does my question make sense?

philipperemy commented 9 months ago

@PhilipAmadasun Oh I see. you just need to compute them one by one and average the result. If your user is x and you want to compare with y1, y2, ...,y3` you just do

np.mean([batch_cosine_similarity(x, y1), batch_cosine_similarity(x, y2), batch_cosine_similarity(x, y3)])

PhilipAmadasun commented 9 months ago

@philipperemy If I want to see if embedding x matches any of the saved embeddings y1,y2, or y3 you're saying I should use np.mean([batch_cosine_similarity(x, y1), batch_cosine_similarity(x, y2), batch_cosine_similarity(x, y3)]), I'm not sure how that makes sense, Shouldn't I compare x to the saved embeddings individually, then choose which comparisons pass some probability threshold?

philipperemy commented 9 months ago

@PhilipAmadasun oh yeah there are lot of ways to do that. What you're saying makes sense. But imagine if you have like 10,000 y_i, and if you take the max() instead of mean(), you will for sure find one y_i that has a high low cosine similarity. But that could be just an artifact. I don't have a strong idea on what would be the best way. You have to try multiple methods and see which ones works the best for your use case.

PhilipAmadasun commented 9 months ago

@philipperemy I'll look into this more. By the way:

np.random.seed(123)
random.seed(123)

Are these important for anything, do they somehow help comparisons?

philipperemy commented 9 months ago

@PhilipAmadasun Not really because tensorflow on a GPU does not ensure that the calculations will be exactly the same. So it's pretty useless actually.

PhilipAmadasun commented 9 months ago

@philipperemy If I wanted deepspeaker to recognize my voice, I suppose it would be better to have saved (and averaged out) embeddings of my voice in various situations? So one embedding of me yelling, one with me at a higher pitch, one with me a little away from the microphone as usual etc? It doesn't seem like just one saved embedding would cut it, but I don't know. Using your averaging technique has slightly improved things, but I am unable to get a cosine similarity higher than 0.6. For context this is my set up.

NOTE I have started to look at the code base and it seems like it can actually be improved. For instance in audio.py:

    # right_blank_duration_ms = (1000.0 * (len(audio) - offsets[-1])) // self.sample_rate
    # TODO: could use trim_silence() here or a better VAD.
    audio_voice_only = audio[offsets[0]:offsets[-1]]
    mfcc = mfcc_fbank(audio_voice_only, sample_rate)

Can seems like it can be modified to use webrtcVAD instead no?

Also Second note: I'm thinking about this averaging method and don't really understand the logic behind it. Most likely because I don't understand vector operations well. I think this actually ties in to what I asked earlier as well. I don't think the averaged out vector estimation of someone's voice taken while they are in a neutral emotional state would match well with the vector estimation of that same person's voice if they are at a heightened emotional state. This is just one scenario I thought of that wouldn't be favorable to this method. I don't even know what method would work for such scenarios when I'm not getting favorable results in more controlled scenarios.