Training the net with smaller batch sizes

Hi,

I am a DSP engineer and acoustician. I'm new in the field of machine learning and neural networks and I am learning a lot working on your code :). I am trying to understand the feasibility to squeeze your VAD graph architecture to be used in real-time. Right now I am able to retrain the neural net modifying a few parameters (mainly the frame window and overlap) using the D2 dataset in your paper. Although, it looks like the best ways to improve the graph computation are:

1) To use a less complex feature extractor other than the MRCG (I pretend to investigate this hypothesis later). 2) To retrain the neural net for smaller batch sizes with D2. In your paper your use 4096 which corresponds to several seconds of audio.

I 've tried to perform this training multiple times for different batch sizes (2048,1024,512,...) and I failed so far in this quest. Sometimes the accuracy of the training achieves a high value but it never generalizes for the test data.

I believe something is wrong on the training parameters and that's because I am contacting you for guidelines. Did you ever try to train the neural net for smaller batch sizes? Something should be changed on the net architecture or on the training parameters?

What is the recommended relative size of the audio to be tested over batch size? More clearly: may I test a graph trained with a large batch size with a very small audio sample? My intuition is that it is not possible. That's because I would like to retrain the net with a smaller batch size so that I could perform sequential tests with an audio buffer of reasonable size (let's say 500 ms).

Finally, do you have any recommendation for simplifying the network computational complexity without sacrificing too much in performance?

I appreciate your time in sharing your knowledge and experience, Cheers, Lucas

jtkim-kaist / VAD

Training the net with smaller batch sizes #29