SPIKE: Speech Recognition (signal processing) Validation

adonahue commented 4 years ago

As a Riff Developer, I am not confident that our speech detection is working correctly, based on the code that I've seen. Specifically, I'm concerned that we are not properly detecting actual speech versus other ambient noise (such as doors slamming, dogs barking, etc), and this results in lower quality data that could impact the accuracy of our analytics.

I would like to look at other speech detection services (for javascript) and determine whether by direct evidence (testing them and seeing results) or evidence provided by others if there is a better speech detection service out there for us to use.

Story Acceptance Criteria A meeting with the rest of the team to report on the following:

[x] A summary of findings (i.e. other potential speech recognition approaches)
[x] A recommendation for how to move forward (with discussion/input from the rest of the team)

If changes are needed

[x] A summary of the new recommended approach, for either fixing our existing library or implementing another, more established and well-vetted library to replace it, with an outline of the user stories for this work
[x] Rough estimates of scope of work for both options (updating or replacing our current library)

jaedoucette commented 4 years ago

Jordan & I dug a bit deeper into what we are doing right now.

It looks like our code to construct a bandpass filter for human voices doesn't actually do anything. It was supposed to make this filter:

Luckily, the default parameters for a bandpass filter make this filter:

These are almost the same. The default one is right-shifted by about 5 Hz. So we might be missing the very deepest voices because of this.

We should probably fix it anyway, but it looks like we lucked out.

jaedoucette commented 4 years ago

Another issue is that we are logging some 0ms length utterances. These are probably going to pick up a lot of background noises. This is caused by line 66/67 of https://github.com/rifflearning/sibilant/blob/master/sibilant.js#L68

The fix is to either log only events that have at least 2 speaking times, or to use the start of the quiet times as the end of the utterance, rather than the timestamp for the last high-volume time.

jaedoucette commented 4 years ago

Possible Node libraries we could use instead:

https://www.npmjs.com/package/node-vad

https://www.npmjs.com/package/voice-activity-detection

VAD looks super simple. Voice-Activity-Detection has some more features we could customize, where VAD picks sensible defaults and hides parameters.

"Voice Activity Detection is based on the method used in the upcoming WebRTC HTML5 standard. Extracted from Chromium for stand-alone use as a library."

jaedoucette commented 4 years ago

@adonahue

We are marking this as ready for review. Our findings are summarized in this document:

https://docs.google.com/document/d/1H17j_gpVpagIeVfVeWZ1XX4sTxlSDEDbQRDr7Rqb_cA/edit?usp=sharing

adonahue commented 4 years ago

@jaedoucette - I am realizing that it's not clear to me what of the recommended work should be done when. I had the understanding that what we had in place was good enough to move forward with new metrics, and wasn't urgent to fix. But from chatting w. @jordanreedie - I think he does not think that's the case. I think I may have misunderstood what work is high priority, how that impacts our ability to make new metrics, and if it's part of the spike or a follow on effort.

jaedoucette commented 4 years ago

@jordanreedie

I'm still not completely clear on what's outstanding for this card. I think we did talk about it, but I've forgotten what we concluded. I think the conclusion was that VAD may or may not be suitable as a replacement, but that the spike is done because the system works "well enough" for now, even if we want to replace it eventually?

Weigh in here, and then we'll have a note for next time.

jordanreedie commented 4 years ago

I think our conclusion was that, yeah, it seems to work well enough for now, but we should fix the bug that causes us to record zero length utterances. In the future, it would be nice to move to a better designed / cleaner library, but at the moment it would take too much effort.

adonahue commented 4 years ago

@jaedoucette @jordanreedie - so it sounds like there is one actionable story right now, which is the zero length utterance bug?

jaedoucette commented 4 years ago

@adonahue Yes. I'll make a card for that, and then close out this spike.

adonahue commented 4 years ago

Awesome, thank you @jaedoucette .

rifflearning / zenhub

SPIKE: Speech Recognition (signal processing) Validation #121