Right now there's some issues with the way Discord sends packets, and especially the SpeakingStateUpdate events. The SpeakingStateUpdate event is never fired on time, and inevitably ends up losing some data.
From the developer of Songbird themselves, apparently the first option would work, with a few differences:
audio packets will always be 3,840 bytes (240 samples), so checking for empty would not work
a portion of the last packet's audio may be included in this packet's current audio, so every event dispatch should check (at least) the last two elements and see if they both equal zero before doing any action.
However we need to do some digging into the audio data format first to verify these ideas.
Right now there's some issues with the way Discord sends packets, and especially the SpeakingStateUpdate events. The SpeakingStateUpdate event is never fired on time, and inevitably ends up losing some data.
Theoretical workarounds: