Open JunzheJosephZhu opened 4 years ago
Btw, as a side note, I found that for more difficult tasks(in my case I'm classifying the output of a mixed speech separator), SincNet is better trained when given, instead of with one random splice per example per iteration, non-overlapping splices that cover the whole signal. In my case the accuracy boosted from 83% to 91%.
Hi Joseph, thank you for sharing your experience. For the SpeechBrain project ( https://speechbrain.github.io/) and for the PASE one ( https://github.com/santi-pdp/pase) we didn't perform signal chunking directly. We just use convolution with stride factors to simulate the sliding windows. In practice, to have a feature vector every 10 ms (160 samples), we use stride factors like e.g, 2 x 2 x 2x 4 x 5 over the various convolutional layers that follow the sinc_conv one. This gives the same performance using a simpler pipeline.
On Tue, 5 May 2020 at 05:18, Joseph notifications@github.com wrote:
Btw, as a side note, I found that for more difficult tasks(in my case I'm classifying the output of a mixed speech separator), SincNet is better trained when given, instead of with one random splice per example per iteration, non-overlapping splices that cover the whole signal. In my case the accuracy boosted from 83% to 91%.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/mravanelli/SincNet/issues/87#issuecomment-623947107, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEA2ZVRPCY6CRTIPUFFNDC3RP7KU3ANCNFSM4MZKFR6A .
Just to clarify, in the paper the overlap is said to be 10ms, but in the code the shift is said to be 10ms. Does that mean between 2 consecutive frames, both beginning and end are only moved by 10ms, so there's 190 ms overlap?