marianne-m / brouhaha-vad

Predicts the level of noise and reverberation on your audiofiles
MIT License
138 stars 24 forks source link

Fix last chunk padding #21

Closed ylacombe closed 5 months ago

ylacombe commented 6 months ago

Hey @MarvinLvn, as discussed offline, here is a proposal that fixes last chunk computation!

Do you have quick tests that I could implement to check if that works as expected and doesn't break the library?

Note that I also propose a fix for #20 which is actually just because of a breaking change in the newest Pyannote https://github.com/pyannote/pyannote-audio/blob/48b68022f707fa376e2331dd9331fbdeadd0e2ab/pyannote/audio/core/model.py#L177-L192

ylacombe commented 6 months ago

I dont't see any big difference in the C50 distribution when running it on MLS English test split, do you think of any ways to make sure it does what's intended?

MarvinLvn commented 5 months ago

Do you have quick tests that I could implement to check if that works as expected and doesn't break the library?

I took a 10-second audio file of clean speech and added reverberation using SOX. I created multiple shorter versions of it, trimming the end of the audio files, considering the 2/4/6/8/10 first seconds.

Here's what we got on the original clean speech segment (no reverb):

1) Before your fix: clean_before_fix

2) After your fix: clean_after_fix

There is no difference in this case; Brouhaha returns a high value of C50, regardless of the file duration. This is the expected behavior.

Now, with the reverberated segment:

1) Before your fix: dirty_before_fix

Brouhaha predicts a high C50 value for the 2-s long audio file, whereas it should predict a low C50 value as we know the speech is highly reverberated.

2) After your fix: dirty_after_fix

Here, the behavior is consistent and no longer depends on the file duration. We get a low C50 value, as expected.

Note that I also propose a fix for https://github.com/marianne-m/brouhaha-vad/issues/20, which is actually just because of a breaking change in the newest Pyannote https://github.com/pyannote/pyannote-audio/blob/48b68022f707fa376e2331dd9331fbdeadd0e2ab/pyannote/audio/core/model.py#L177-L192

Awesome! Thank you so much :)

I dont't see any big difference in the C50 distribution when running it on MLS English test split, do you think of any ways to make sure it does what's intended?

Hmm, how strange! You should at least have a (quite strong) difference for short audio files that contain reverberation. Can you check the distribution shift for these specific segments?

I had to add a few fixes for the apply function to work (main.py). Are you sure you're using this to predict the C50 values?