Convolutional Neural Networks for Small-footprint Keyword Spotting

Abstract

Propose CNN approach to Keyword Spotting(KWS) task, which outperforms DNN approach by 27~44%
Experiments with limiting multiplies, limiting parameters

KWS task that is performed in mobile devices must be accurate and fast.
Existing DNN architecture
- 40 dimension log-mel filterbank features every 25ms with 10ms frame shift as input
- outputs filler(blank), answer and call
- posterior handling to combine into single score
CNN Architecture
- Good description of CNN
- Typical CNN architectures
- in-depth analysis of effects of varying parameters such as conv filter, strides, pooling in time/freq
- Limit Multiplies
- notice 500K compared to 9M in typical CNN
- Limit Parameters
- by pooling in time and frequency
Result
- Pooling in Frequency
- CNN performance improves as we increase the pooling size, and saturate when p = 3
- Best CNN model outperforms DNN by 41%
- Limiting Multiplies
- Best performance with striding frequency filter with 50% overlap but no pool in frequency
- Pooling in frequency is helpful, but should reduce features maps drastically to limit computation
- Limiting Parameters
- Stride in Time leads to worse performance
- Pooling in Time improves performance. Pooling in time and therefore modeling the relationship between neighboring frames before sub-sampling, is more effective than striding in time which a-priori selects which neighboring frames to filter.

Good experimental setup, understanding the effect of changing variables
Limiting in Parameters/Multiplies to achieve better model is a good application practice