The duplicated silence problem is an artefact of some soft 'phones that exist in the wild.
The scenario is a soft 'phone being put on hold and taken off hold by the other end. These particular soft 'phones sometimes end up sending two RTP packets as follows when they come off hold.
The first packet has the next timestamp (e.g. 160 ticks after the last packet when the 'phone went on hold) and some audio, that had been cut off when the 'phone was set to receive-only. It is delayed in arrival time by the entire hold period.
The next packet is no further delayed, but has a timestamp incorporating the silence period.
An actual receiving UA drops the first, delayed-arrival, audio; and the silence is implicit. extractaudio tries to do silence interpolation to sort-of match this (to the extent that it can, given that it doesn't discard audio). But it has two triggers for interpolated silence which are both triggered in this scenario. One trigger is the delayed arrival of old-timestamped audio. Another trigger is the gap in the timestamps. This ends up effectively doubling the silence period in the output audio file. (Heartbeats have some slight effect on this, and also result in timestamps decreasing if the hold period is longer than 800 seconds.)
This rework adjusts the silence interpolation as follows:
Silence interpolated from sender timestamp gaps is always generated.
Silence interpolated from arrival time delay is saved, and is only detected if the delayed packet actually carries some audio data (thereby excluding heartbeats).
Arrival-time silence is added after the audio in the delayed packet, rather than before it. It is prepended to the next packet of audio.
If sender-indicated silence follows in the next packet, it is deducted from the amount of arrival-time silence.
The duplicated silence problem is an artefact of some soft 'phones that exist in the wild.
The scenario is a soft 'phone being put on hold and taken off hold by the other end. These particular soft 'phones sometimes end up sending two RTP packets as follows when they come off hold.
An actual receiving UA drops the first, delayed-arrival, audio; and the silence is implicit. extractaudio tries to do silence interpolation to sort-of match this (to the extent that it can, given that it doesn't discard audio). But it has two triggers for interpolated silence which are both triggered in this scenario. One trigger is the delayed arrival of old-timestamped audio. Another trigger is the gap in the timestamps. This ends up effectively doubling the silence period in the output audio file. (Heartbeats have some slight effect on this, and also result in timestamps decreasing if the hold period is longer than 800 seconds.)
This rework adjusts the silence interpolation as follows: