For users to correctly handle packet reordering or jitter due to network instability, they currently need to disable our decoding features and reimplement decoding logic themselves -- this gets more complex when handling audio data from several users, which is interspersed and can arrive at different parts of each 20ms window. This makes downmixing particular nasty.
Ideally, we should take this complexity on ourselves. Doing so will require a few changes:
Packet events must be separated from decode events. We likely want to write and store audio data into an Arc<[u8]> to minimise copying between these.
Decode events must be generated every 20ms, and include audio from every active participant (accessible by SSRC).
Each user needs their own resizable jitter/reorder buffer (likely a deque of Arcs). This should have user-configurable target and max depths. If the buffer was allowed to empty (e.g., user disconnects or stops speaking), then it must build back up to its target occupancy before we decode packets again.
Packet events should no longer include audio data -- this is to be moved to the decode event.
For users to correctly handle packet reordering or jitter due to network instability, they currently need to disable our decoding features and reimplement decoding logic themselves -- this gets more complex when handling audio data from several users, which is interspersed and can arrive at different parts of each 20ms window. This makes downmixing particular nasty.
Ideally, we should take this complexity on ourselves. Doing so will require a few changes:
Arc
s). This should have user-configurable target and max depths. If the buffer was allowed to empty (e.g., user disconnects or stops speaking), then it must build back up to its target occupancy before we decode packets again.