mvcisback / SSLVC

Sound Source Localization using Visual Cues
4 stars 1 forks source link

Voice activity detection #17

Closed mvcisback closed 9 years ago

mvcisback commented 9 years ago

from #16

( detect when there is a voice, capture the index and find the corresponding frame number. Save the voice > for say 10 sec long when the speaker is talking. Make sure there is no silence in the voice. Save voices for each frame. e.g. Frame 10 voice : 1_300 samples Frame 11: 1_300...)

mvcisback commented 9 years ago

I'm considering prototyping this by next Sat.

This would include recording from the microphone with an interactive calibration phase and outputting a model.

Then a second phase would load the model and use it to classify.

My current approach is going to be to split this into 3 scripts:

I think doing it this way opens the possibility of plugging in with lots of tooling.

ghost commented 9 years ago

So, you're going to train your data as well? I guess that make sense, since you know more about your data.

mvcisback commented 9 years ago

yep. Also the proposal made it seem like there would be a calibration phase anyway, so I figured I'd explicitly do that from the start. I can always save the .mat files from the calibration phase for testing between classifiers.

ghost commented 9 years ago

sounds good to me!

On Sun, Nov 2, 2014 at 12:15 PM, Marcell Vazquez-Chanlatte < notifications@github.com> wrote:

yep. Also the proposal made it seem like there would be a calibration phase anyway, so I figured I'd explicitly do that from the start. I can always save the .mat files from the calibration phase for testing between classifiers.

— Reply to this email directly or view it on GitHub https://github.com/mvcisback/SSLVC/issues/17#issuecomment-61421622.

mvcisback commented 9 years ago

This needs to return a frame batch number, probabilities, and weight based on overall confidence that there was speech.

mvcisback commented 9 years ago

@ramili @ffaghri1 hey guys, sorry I've not been very active (I had a deadline moved up 2 weeks so I've been in a bit of a panic). I've prototyped this in python, and was wondering if I could meet with one of you to port it to matlab.

ghost commented 9 years ago

Same here, not even close to be finish coding this face thingy! I can stay longer this Thursday since I'm done with my dept seminar. We can go over your code and convert it back to Matlab as well. I can also help Faraz with the GUI if needed.

On Mon, Nov 17, 2014 at 9:48 PM, Marcell Vazquez-Chanlatte < notifications@github.com> wrote:

@ramili https://github.com/ramili @ffaghri1 https://github.com/ffaghri1 hey guys, sorry I've not been very active (I had a deadline moved up 2 weeks so I've been in a bit of a panic). I've prototyped this in python, and was wondering if I could meet with one of you to port it to matlab.

— Reply to this email directly or view it on GitHub https://github.com/mvcisback/SSLVC/issues/17#issuecomment-63425805.

ffaghri1 commented 9 years ago

Not much progress on gui either. I could certainly use some help. I'm also able to meet on Thursday. On Nov 18, 2014 9:34 AM, "ramili" notifications@github.com wrote:

Same here, not even close to be finish coding this face thingy! I can stay longer this Thursday since I'm done with my dept seminar. We can go over your code and convert it back to Matlab as well. I can also help Faraz with the GUI if needed.

On Mon, Nov 17, 2014 at 9:48 PM, Marcell Vazquez-Chanlatte < notifications@github.com> wrote:

@ramili https://github.com/ramili @ffaghri1 https://github.com/ffaghri1 hey guys, sorry I've not been very active (I had a deadline moved up 2 weeks so I've been in a bit of a panic). I've prototyped this in python, and was wondering if I could meet with one of you to port it to matlab.

Reply to this email directly or view it on GitHub https://github.com/mvcisback/SSLVC/issues/17#issuecomment-63425805.

Reply to this email directly or view it on GitHub https://github.com/mvcisback/SSLVC/issues/17#issuecomment-63488751.

ghost commented 9 years ago

no worries, we got plenty of time. I'll see you guys tomorrow 2:30 same place

On Wed, Nov 19, 2014 at 4:18 PM, Faraz Faghri notifications@github.com wrote:

Not much progress on gui either. I could certainly use some help. I'm also able to meet on Thursday. On Nov 18, 2014 9:34 AM, "ramili" notifications@github.com wrote:

Same here, not even close to be finish coding this face thingy! I can stay longer this Thursday since I'm done with my dept seminar. We can go over your code and convert it back to Matlab as well. I can also help Faraz with the GUI if needed.

On Mon, Nov 17, 2014 at 9:48 PM, Marcell Vazquez-Chanlatte < notifications@github.com> wrote:

@ramili https://github.com/ramili @ffaghri1 https://github.com/ffaghri1 hey guys, sorry I've not been very active (I had a deadline moved up 2 weeks so I've been in a bit of a panic). I've prototyped this in python, and was wondering if I could meet with one of you to port it to matlab.

Reply to this email directly or view it on GitHub https://github.com/mvcisback/SSLVC/issues/17#issuecomment-63425805.

Reply to this email directly or view it on GitHub https://github.com/mvcisback/SSLVC/issues/17#issuecomment-63488751.

— Reply to this email directly or view it on GitHub https://github.com/mvcisback/SSLVC/issues/17#issuecomment-63724771.

ffaghri1 commented 9 years ago

Will be there On Nov 19, 2014 8:35 PM, "ramili" notifications@github.com wrote:

no worries, we got plenty of time. I'll see you guys tomorrow 2:30 same place

On Wed, Nov 19, 2014 at 4:18 PM, Faraz Faghri notifications@github.com wrote:

Not much progress on gui either. I could certainly use some help. I'm also able to meet on Thursday. On Nov 18, 2014 9:34 AM, "ramili" notifications@github.com wrote:

Same here, not even close to be finish coding this face thingy! I can stay longer this Thursday since I'm done with my dept seminar. We can go over your code and convert it back to Matlab as well. I can also help Faraz with the GUI if needed.

On Mon, Nov 17, 2014 at 9:48 PM, Marcell Vazquez-Chanlatte < notifications@github.com> wrote:

@ramili https://github.com/ramili @ffaghri1 https://github.com/ffaghri1 hey guys, sorry I've not been very active (I had a deadline moved up 2 weeks so I've been in a bit of a panic). I've prototyped this in python, and was wondering if I could meet with one of you to port it to matlab.

Reply to this email directly or view it on GitHub https://github.com/mvcisback/SSLVC/issues/17#issuecomment-63425805.

Reply to this email directly or view it on GitHub https://github.com/mvcisback/SSLVC/issues/17#issuecomment-63488751.

Reply to this email directly or view it on GitHub https://github.com/mvcisback/SSLVC/issues/17#issuecomment-63724771.

Reply to this email directly or view it on GitHub https://github.com/mvcisback/SSLVC/issues/17#issuecomment-63752508.

mvcisback commented 9 years ago

The output of this will be a vector classifying each frame of the audio as 0, 1, 2. where:

mvcisback commented 9 years ago

Hey just status update.

I've got the classifier working for online learning. Just started on pre processed version.

here's a snippet from the conversion

from scipy.io.wavfile import read as wave_read
from tempfile import mkdtemp
from subprocess import check_output
from os import path

def to_wav(mp4_path):
    with TemporaryDirectory() as d:
        wav_path = path.join(d, "out.wav")
        check_output(["ffmpeg", "-i", mp4_path, wav_path])
        return wave_read(wav_path)

def to_buckets(wav, fps, sample_rate):
    bucket_size = sample_rate // FPS 
    num_buckets = len(wav) // bucket_size
    return wav[:num_buckets*bucket_size].reshape(num_buckets, bucket_size)

I think this should be able to plug into my current classifier. I just need to find the energy. Still in ipython notebooks. I'll check in a self contained script soon.

mvcisback commented 9 years ago

So, I've just done the experiment with giving labels by frame. Its correct 77.6% of the time.

It's using a linear SVM for classifications.

That said, I was getting much better results with my online tests (~100%). I think its mainly that the setup is better (talking into mic directly + no compression), as well as the 2nd speaker was female.

I'd like to use a probabilistic method next to bias towards the status quo for speakers.

ghost commented 9 years ago

Are you using MFCC for features? I'd probably get back to you in a next few days for that .mat file I need for spatial sound reconstruction.

On Fri, Nov 28, 2014 at 11:15 PM, Marcell Vazquez-Chanlatte < notifications@github.com> wrote:

So, I've just done the experiment with giving labels by frame. Its correct 77.6% of the time.

  • Audio detection: .96789696
  • Me vs Faraz: .799359658485

It's using a linear SVM for classifications.

That said, I was getting much better results with my online tests (~100%). I think its mainly that the setup is better (talking into mic directly + no compression).

I'd like to use a probabilistic method next to bias towards the status quo for speakers.

— Reply to this email directly or view it on GitHub https://github.com/mvcisback/SSLVC/issues/17#issuecomment-64944045.

mvcisback commented 9 years ago

Just log spectrogram, no feature changes.

I'll try out MFCC

mvcisback commented 9 years ago

Ok, found a good library for MFCC features:

https://github.com/jameslyons/python_speech_features

Should be an easy change

mvcisback commented 9 years ago

@ramili Just tried with MFCC features. Voice detection goes up to about 99% using a Linear SVM! (its bloody fast too).

Voice detection still hovers around 80% (with the occasional 40% I think there is some stochastic issue at play....)

Any ideas for next steps? I think it may have to be non-linear.

mvcisback commented 9 years ago

Naive Bayes un-tuned performs similarly for voice detection, but worse (~50%) for classifying the speaker

mvcisback commented 9 years ago

SVC with rbf kernel gives (0.98979298447383557, 0.76237623762376239) using MFCC features

mvcisback commented 9 years ago

KNN gives similar results.

I'm suspecting that perhaps the input data isn't satisfactory. Tomorrow I'll try to convince my roommate to record some new input data with me and test the online performance again.

ghost commented 9 years ago

eh, that's not too bad anyway, and it's good that you tried all these classifiers, something to put on the report. You can also apply PCA to your MFCC features, drop it down to 2D plot it and see if they are linearly separable, if not you might have to apply kernel or go into higher dimension. Since you guys are saying the same stuff in the sound,I'm suspecting that there is lots of similarities between the spectral coefficients, so we might need a much bigger training database to be focus the model on recognizing speaker not word. Anyway, it's no big deal.

How'd you do your VAD? Did you end up making a noise class as well?

I convert the video to have 25 fps. Do you think you can give me a folder with a bunch of wav files that say is 5 fps audio with the labels. So, you have like 1_1.wav, ....1_40.wav....0_69.wav...2_843.wav, The first number is the label and the second is the frame number. or you could also put them into a matrix(first row labels) and give me a .mat file.That makes it easier for me to pick them up. It might also shed some light on your issue knowing which sounds where dis-classified. I haven't worked on this project in a past few days, so I don't remember the detail, but I should have a little more time on my hand after Tuesday.

On Sun, Nov 30, 2014 at 8:58 PM, Marcell Vazquez-Chanlatte < notifications@github.com> wrote:

KNN gives similar results.

I'm suspecting that perhaps the input data isn't satisfactory. Tomorrow I'll try to convince my roommate to record some new input data with me and test the online performance again.

— Reply to this email directly or view it on GitHub https://github.com/mvcisback/SSLVC/issues/17#issuecomment-65020997.

mvcisback commented 9 years ago

@ramili Even if they weren't linearly separable, wouldn't KNN (nearest neighbors) account for that.

For VAD is using a linear SVM, where I labeled silent frames and non silent frames. (I'll upload the wavs used to dropbox). The classifier returns 0 or 1.

I'm abit confused what you want from the audio side. What do you mean 5ps audio? Do you mean the audio of 5 frames (since 5 frames would happen is the video was at 25fps)?

ghost commented 9 years ago

Yes, I meant to say 5fps,you're frame of analysis is 0.2sec. 25 fps is probably too much computation.I just want to make sure we have a spatial video of some quality for a demo on dec 15. I'm not sure how KNN helps with non-linearity. In any cases, it's really not a big deal, I think the accuracies you have are good enough.

On Mon, Dec 1, 2014 at 9:45 AM, Marcell Vazquez-Chanlatte < notifications@github.com> wrote:

@ramili https://github.com/ramili Even if they weren't linearly separable, wouldn't KNN (nearest neighbors) account for that.

For VAD is using a linear SVM, where I labeled silent frames and non silent frames. (I'll upload the wavs used to dropbox). The classifier returns 0 or 1.

I'm abit confused what you want from the audio side. What do you mean 5ps audio? Do you mean the audio of 5 frames (since 5 frames would happen is the video was at 25fps)?

— Reply to this email directly or view it on GitHub https://github.com/mvcisback/SSLVC/issues/17#issuecomment-65104184.

mvcisback commented 9 years ago

@ramili ok, I'll see what I can do.

Also KNN should be able to deal with the non linearity since all it does is cluster. So even if the region is some weird disjoint blob in our space, in theory KNN would be able to represent it. I think 2 things are causing problems, silent frames between pauses in speech and the fact that the words really do sound similar when we speak.

This is where a probabilistic approach would really shine since if the video gave a probability of speech and the audio gave a probability of speech, we could combine them for a better estimate.

ghost commented 9 years ago

I see, that make sense. no rush on that, I'll probably won't get to it until later this week anyway.

On Mon, Dec 1, 2014 at 10:27 AM, Marcell Vazquez-Chanlatte < notifications@github.com> wrote:

@ramili https://github.com/ramili ok, I'll see what I can do.

Also KNN should be able to deal with the non linearity since all it does is cluster. So even if the region is some weird disjoint blob in our space, in theory KNN would be able to represent it. I think 2 things are causing problems, silent frames between pauses in speech and the fact that the words really do sound similar when we speak.

This is where a probabilistic approach would really shine since if the video gave a probability of speech and the audio gave a probability of speech, we could combine them for a better estimate.

— Reply to this email directly or view it on GitHub https://github.com/mvcisback/SSLVC/issues/17#issuecomment-65110710.

mvcisback commented 9 years ago

@ramili just pushed the preliminary script for classifying audio (classify_audio.py). I'll add running instructions tomorrow.

The quick and dirty of it is:

If you have suggestions lmk

ghost commented 9 years ago

I think that make some sense to me! For the classification part, it would be nice to have a function where you can change the frame of analysis, number of MFCC, and the block size and spit out the overall training cross validation results for the report. I suggest taking like 90% of the data for training and remaining for testing.

On Fri, Dec 5, 2014 at 11:27 PM, Marcell Vazquez-Chanlatte < notifications@github.com> wrote:

@ramili https://github.com/ramili just pushed the preliminary script for classifying audio (classify_audio.py). I'll add running instructions tomorrow.

The quick and dirty of it is:

  • it needs ffmpeg, python2, numpy, scipy, scikit-learn, funcy, click, and a speech features lib.
  • usage: ./classify_audio.py videos/1_stationary_single.mp4 out.mat --silence silence1.wav --speaker1 marcell1.wav --speaker2 faraz1.wav

If you have suggestions lmk

— Reply to this email directly or view it on GitHub https://github.com/mvcisback/SSLVC/issues/17#issuecomment-65888694.

mvcisback commented 9 years ago

Ok, I just updated the interface of the script to expose features and frames per window.

Usage: classify_audio.py [OPTIONS] INPUT_MP4 OUTPUT_MAT

Options:
  --silence PATH                [required]
  --speaker1 PATH               [required]
  --speaker2 PATH               [required]
  --num-features INTEGER
  --num-frames INTEGER
  --help                        Show this message and exit.

I'm thinking about the best way to support the 90% of the data for training option. Not really blocked from an ability standpoint, just trying to make it pretty.

Also, right now it'll return 1 answer for num-frames.

So if you say, num-frames=2 and there are 30 frames, you'll get back mat file with a vector of length 15.

If its not evenly divisible I think it currently will drop the last frames.

mvcisback commented 9 years ago

@ramili Added some random shuffling of test and training data partitions. Seems to have moved both classifications to 99%

Need to double check that nothing is wrong

mvcisback commented 9 years ago

For the record:

(audio)➜ project git:(master) ✗
./classify_audio.py --noise silence1.wav --speaker1 marcell1.wav --speaker2 faraz1.wav videos/1_stationary_single.mp4 out.mat --verbose
(0.99333333333333329, 0.98999999999999999)
mvcisback commented 9 years ago

@ramili @ffaghri1 Ok, I'm almost ready to close this issue.

Usage: classify_audio.py [OPTIONS] INPUT_MP4 OUTPUT_MAT

Options:
  --noise PATH                  [required]
  --speaker1 PATH               [required]
  --speaker2 PATH               [required]
  --num-features INTEGER
  --num-frames INTEGER
  --verbose / --silent
  --fps INTEGER
  --help                        Show this message and exit.

Where noise, speaker1, and speaker2 point to training wavs (I've been using slience1.wav, marcell1.wav, faraz1.wav which I uploaded to the dropbox).

To install:

The output is a mat should hold one variable x which is a 1d array of length ([number of frames in video]/[num-frames parameter from cli])

LMK if you need anything clarified or need help installing. (I'll be working on HW4)

ghost commented 9 years ago

Can you attach the output vector in .mat format for whatever fps you have, for speaker number and the probability given each class when you can. I don't need/want to run your code!

On Sun, Dec 7, 2014 at 12:58 PM, Marcell Vazquez-Chanlatte < notifications@github.com> wrote:

@ramili https://github.com/ramili @ffaghri1 https://github.com/ffaghri1 Ok, I'm almost ready to close this issue.

  • script is in the repo as classify_audio.py
  • Here is the current help documentation generated by ./classify_audio.py --help

Usage: classify_audio.py [OPTIONS] INPUT_MP4 OUTPUT_MAT

Options: --noise PATH [required] --speaker1 PATH [required] --speaker2 PATH [required] --num-features INTEGER --num-frames INTEGER --verbose / --silent --fps INTEGER --help Show this message and exit.

Where noise, speaker1, and speaker2 point to training wavs (I've been using slience1.wav, marcell1.wav, faraz1.wav which I uploaded to the dropbox).

To install:

The output is a mat should hold one variable x which is a 1d array of length ([number of frames in video]/[num-frames parameter from cli])

LMK if you need anything clarified or need help installing. (I'll be working on HW4)

— Reply to this email directly or view it on GitHub https://github.com/mvcisback/SSLVC/issues/17#issuecomment-65954694.

mvcisback commented 9 years ago

@ramili sure but for which video? I don' know if the training data transfers for some most the videos unless recorded under similar conditions (hence why we have a calibration phase).

Now that I think about it, I should probably not just have a static speaker1, speaker2 flag, that ways its easy to use this independent of the number of speakers.

ghost commented 9 years ago

well my codes for face and spatial audio also only works for one and two people, so no worries on that. Can you send those vectors for the training video? I'll use it to motivate the spatial audio. And if you have time, can you also train two GMM classes base on the same video for your and Faraz voices, and then use the video for the one you and me talking in front of the wall to identify which frames your talking at with label '1' and everything else with '0'. You might have to define a threshold, so you can disregard my voice as noise (if p<delta => noise). I'm going to use your vector blare my face for the frames your talking into to motivate the classification part. If you don't have time, that's okay! I didn't really think through that calibration phase! the idea was that if a new user decided to make classes for their own voices and faces, they can do that using the calibration phase and then apply the results of face and speech classes to any other videos.

On Sun, Dec 7, 2014 at 3:54 PM, Marcell Vazquez-Chanlatte < notifications@github.com> wrote:

@ramili https://github.com/ramili sure but for which video? I don' know if the training data transfers for some most the videos unless recorded under similar conditions (hence why we have a calibration phase).

Now that I think about it, I should probably not just have a static speaker1, speaker2 flag, that ways its easy to use this independent of the number of speakers.

— Reply to this email directly or view it on GitHub https://github.com/mvcisback/SSLVC/issues/17#issuecomment-65962066.

mvcisback commented 9 years ago

Ugh, so I just tried to plot the results....and it ends up being really really bad.....

I think it has to do with the training data sucking. I'm going to take another crack at it when I have time.

I won't have to much free time till Tues or Weds though. If one of you wants to meet to try and figure something out let me know

mvcisback commented 9 years ago

I think, what I might do is by hand train a clap classifier or template matcher and then automatically analysis the video. That should make it easier to create training data.

ghost commented 9 years ago

sounds good. Wednesday works for me.

On Sun, Dec 7, 2014 at 5:42 PM, Marcell Vazquez-Chanlatte < notifications@github.com> wrote:

I think, what I might do is by hand train a clap classifier or template matcher and then automatically analysis the video. That should make it easier to create training data.

— Reply to this email directly or view it on GitHub https://github.com/mvcisback/SSLVC/issues/17#issuecomment-65966639.

mvcisback commented 9 years ago

Image of features mfcc_features

mvcisback commented 9 years ago

using pca to project down to 2 features....I suspect this doesn't work well since there are 3 relevant bands, so mixing is inevitable

red = silence, blue =marcell, green = faraz

pca_2d

mvcisback commented 9 years ago

@ramili, @ffaghri1

using log_spectrogram I was able to produce the following results!:

log_spec_plot_2d

I have a video conference scheduled soon, but after that I'll experiment some more

ghost commented 9 years ago

huh! I just applied GMM, it gives 83% accuracy. And your voice and faraz's voice are not linearly separable at all!

On Thu, Dec 11, 2014 at 3:08 PM, Marcell Vazquez-Chanlatte < notifications@github.com> wrote:

@ramili https://github.com/ramili, @ffaghri1 https://github.com/ffaghri1

using log_spectrogram I was able to produce the following results!:

[image: log_spec_plot_2d] https://cloud.githubusercontent.com/assets/388723/5404045/433b4b6c-8158-11e4-8c2a-f4fc9397b595.png

I have a video conference scheduled soon, but after that I'll experiment some more

— Reply to this email directly or view it on GitHub https://github.com/mvcisback/SSLVC/issues/17#issuecomment-66706227.

ffaghri1 commented 9 years ago

No kidding :D Long live linear SVM!

On Thu, Dec 11, 2014 at 5:40 PM, ramili notifications@github.com wrote:

huh! I just applied GMM, it gives 83% accuracy. And your voice and faraz's voice are not linearly separable at all!

On Thu, Dec 11, 2014 at 3:08 PM, Marcell Vazquez-Chanlatte < notifications@github.com> wrote:

@ramili https://github.com/ramili, @ffaghri1 https://github.com/ffaghri1

using log_spectrogram I was able to produce the following results!:

[image: log_spec_plot_2d] < https://cloud.githubusercontent.com/assets/388723/5404045/433b4b6c-8158-11e4-8c2a-f4fc9397b595.png>

I have a video conference scheduled soon, but after that I'll experiment some more

Reply to this email directly or view it on GitHub https://github.com/mvcisback/SSLVC/issues/17#issuecomment-66706227.

Reply to this email directly or view it on GitHub https://github.com/mvcisback/SSLVC/issues/17#issuecomment-66709676.

mvcisback commented 9 years ago

Ok, I was able to figure out the bug causing such perfect separability. I was shuffling one the components rather than time series. This led to random noise of in the principle components in each. Mystery solved

mvcisback commented 9 years ago

pca_2_mfcc pca_2_comp_log_spec

mvcisback commented 9 years ago

Use this for mfcc instead.

pca_2_comp_mfcc