G.711 20ms packetized TTS streams on outbound channels are chopped into 10ms on Asterisk 13

sfgeorge commented 4 years ago

Synopsis

Any currently available version of asterisk-unimrcp capable of being installed with Asterisk 13 produces RTP that violates the G.711 standard of 20ms packetization when TTS is sent on an outbound stream.

Asterisk/asterisk-unimrcp takes valid 20ms audio packets received from the TTS server and transcodes them into 10ms packets before sending them out.

Versions Tested

Asterisk	asterisk-unimrcp	SIP Channel Driver	Symptoms
asterisk-1.8.13.0	asterisk-unimrcp-1.3.1	chan_sip	Not affected
asterisk-13.19.1	asterisk-unimrcp-1.3.1	chan_sip	Affected
asterisk-13.19.1	asterisk-unimrcp-1.5.2	chan_sip	Affected
asterisk-13.19.1	asterisk-unimrcp-1.5.2	chan_pjsip	Not affected

Requirements to Reproduce

10ms packetization occurs with either MRCPSynth() or SynthAndRecog() only during TTS playback, not audio file. (If SynthAndRecog() plays an audio file immediately following the TTS the audio stream, that audio file is streamed with proper 20ms packetization.)
Occurs invoking one of the above 2 applications on an outbound leg of a call - either a pure outbound call (An Asterisk Originate action) or a bridged call using Dial(). Does not occur on an inbound into asterisk
Occurs with Asterisk 13, but only during asterisk-unimrcp involvement for TTS.
Occurs only with the deprecated chan_sip channel driver; switching to chan_pjsip resolves the issue.

TTS Server -> Asterisk/asterisk-unimrcp Audio Stream

TTS Server to Asterisk/asterisk-unimrcp Audio Stream: Graph of negligible jitter

TTS Server to Asterisk/asterisk-unimrcp Audio Stream: Packets

Above: TTS Server -> Asterisk/asterisk-unimrcp Audio Stream with proper 20ms packets and negligible sub-millisecond jitter.

Same Audio Stream transcoded and sent-out by Asterisk/asterisk-unimrcp

Asterisk/asterisk-unimrcp to Outbound Audio Stream: Graph of Extreme Jitter

Asterisk/asterisk-unimrcp to Outbound Audio Stream: Packets

Above: Asterisk/asterisk-unimrcp -> Outbound Audio Stream transitioning to improper 10ms packets and high jitter when TTS is sent out.

Configuration

mrcp.conf

[general]
default-tts-profile = tts-mrcp1
log-level = DEBUG
max-connection-count = 100
offer-new-connection = 1
; rx-buffer-size = 1024
; tx-buffer-size = 1024
request-timeout = 1000

[tts-mrcp1]
version = 1
server-port = 554
server-ip = tts.myserver.com
rtp-ip = 192.168.1.2;
rtp-port-min = 40000
rtp-port-max = 65535
speechsynth = synthesizer

Note: I've also tried with the following settings appended to the [tts-mrcp1] section, with no change in behavior

; === Jitter buffer settings ===
playout-delay = 0
max-playout-delay = 0
; === RTP settings ===
ptime = 20
codecs = PCMU telephone-event/101/8000

extensions.conf

Inbound calls are configured to route to [corrupted-audio-reproducible-case]

[corrupted-audio-reproducible-case]
exten => _X.,1,Answer
same => n,Wait(1)
; The following TTS media to the Caller is received and proxied in proper packets of ptime 20ms samples (size 160).
same => n,MRCPSynth(Hello Caller.  Please wait while we attempt to connect you to the Recipient.,p=default)
same => n,Set(outbound_proxy_ip=192.168.1.3) ; customize this
same => n,Set(recipient_number=15551231234) ; customize this
same => n,Set(dial_timeout=120)
same => n,Set(call_time_limit=28800000)
same => n,Set(recipient_welcome_macro=recipient-whisper)
same => n,Dial(SIP/${outbound_proxy_ip}/${recipient_number},${dial_timeout},L(${call_time_limit})M(${recipient_welcome_macro}))

[macro-recipient-whisper]
exten => s,1,NoOp(Note: Playing welcome whisper message to the recipient)
same => n,Wait(1)
; The following TTS media to the Recipient is received from the TTS server in ptime 20ms samples (size 160),
; but is unexpectedly transcoded into ptime 10ms samples (size 80) upon retransmission.
same => n,MRCPSynth(Hello Recipient.  We are connecting you to the Caller now.  Enjoy your conversation.,p=default)
same => n,NoOp(Note: Finished playing welcome whisper message to the recipient)

Notes

With verbose RTP debugging in Asterisk enabled (sudo asterisk -rx 'rtp set debug on'), it is clear to see that Asterisk transitions from ptime 20ms packets (size 160) to ptime 10ms packets (size 80) while proxying/transcoding TTS:

[...]

[2019-10-09 13:06:44.888] VERBOSE[25240][C-00000007] res_rtp_asterisk.c: Sent RTP packet to 192.168.1.3:39348 (type 00, seq 022490, ts 059520, len 000160) [2019-10-09 13:06:44.908] VERBOSE[25240][C-00000007] res_rtp_asterisk.c: Sent RTP packet to 192.168.1.3:39348 (type 00, seq 022491, ts 059680, len 000160) [2019-10-09 13:06:44.928] VERBOSE[25240][C-00000007] res_rtp_asterisk.c: Sent RTP packet to 192.168.1.3:39348 (type 00, seq 022492, ts 059840, len 000160) [2019-10-09 13:06:44.948] VERBOSE[25240][C-00000007] res_rtp_asterisk.c: Sent RTP packet to 192.168.1.3:39348 (type 00, seq 022493, ts 060000, len 000160) [2019-10-09 13:06:44.968] VERBOSE[25240][C-00000007] res_rtp_asterisk.c: Sent RTP packet to 192.168.1.3:39348 (type 00, seq 022494, ts 060160, len 000160) [2019-10-09 13:06:45.411] VERBOSE[25240][C-00000007] app_dial.c: SIP/192.168.1.3-0000000c answered SIP/foobox-0000000b [2019-10-09 13:06:45.411] VERBOSE[25240][C-00000007] pbx.c: Executing [s@macro-recipient-whisper:1] NoOp("SIP/192.168.1.3-0000000c", "Note: Playing welcome whisper message to the recipient" ) in new stack [2019-10-09 13:06:45.411] VERBOSE[25240][C-00000007] pbx.c: Executing [s@macro-recipient-whisper:2] Wait("SIP/192.168.1.3-0000000c", "1") in new stack [2019-10-09 13:06:46.411] VERBOSE[25240][C-00000007] pbx.c: Executing [s@macro-recipient-whisper:3] MRCPSynth("SIP/192.168.1.3-0000000c", "Hello Recipient. We are connecting you to the Caller now. Enjoy your conversation.,p=default") in new stack

[...]

[2019-10-09 13:06:46.433] NOTICE[22592] src/rtsp_client.c: Receive RTSP Data 192.168.1.2:44926 <-> 192.168.1.6:554 [131 bytes] RTSP/1.0 200 OK CSeq: 2 Session: 3264602811 Content-Type: application/mrcp Content-Length: 30

MRCP/1.0 1 200 IN-PROGRESS

[2019-10-09 13:06:46.433] NOTICE[22592] src/rtsp_client.c: Pop In-Progress RTSP Request 0x7fe4e400d6c8 CSeq:2 [2019-10-09 13:06:46.433] DEBUG[22592] src/apt_task.c: Signal Message to [MRCP Client] [0x7fe4d4003c60;1;2] [2019-10-09 13:06:46.433] DEBUG[22592] src/apt_poller_task.c: Wait for Messages [tts-mrcp1] [2019-10-09 13:06:46.433] DEBUG[22589] src/apt_task.c: Process Message [MRCP Client] [0x7fe4d4003c60;1;2] [2019-10-09 13:06:46.433] NOTICE[22589] src/mrcp_client_session.c: Raise App MRCP Response TTS-12 <3264602811> [2019-10-09 13:06:46.433] DEBUG[22589] app_mrcpsynth.c: (TTS-12) REQUEST IN PROGRESS [2019-10-09 13:06:46.433] DEBUG[22589] speech_channel.c: (TTS-12) READY ==> PROCESSING [2019-10-09 13:06:46.433] DEBUG[22589] src/apt_consumer_task.c: Wait for Messages [MRCP Client] [2019-10-09 13:06:46.468] VERBOSE[22591] res_rtp_asterisk.c: Sent RTP packet to 192.168.1.3:39348 (type 00, seq 022495, ts 184216, len 000080) [2019-10-09 13:06:46.478] VERBOSE[22591] res_rtp_asterisk.c: Sent RTP packet to 192.168.1.3:39348 (type 00, seq 022496, ts 184296, len 000080) [2019-10-09 13:06:46.488] VERBOSE[22591] res_rtp_asterisk.c: Sent RTP packet to 192.168.1.3:39348 (type 00, seq 022497, ts 184376, len 000080) [2019-10-09 13:06:46.498] VERBOSE[22591] res_rtp_asterisk.c: Sent RTP packet to 192.168.1.3:39348 (type 00, seq 022498, ts 184456, len 000080) [2019-10-09 13:06:46.508] VERBOSE[22591] res_rtp_asterisk.c: Sent RTP packet to 192.168.1.3:39348 (type 00, seq 022499, ts 184536, len 000080)

[...]

achaloyan commented 4 years ago

Thanks for such a detailed report. Being extremely busy over these days, I unfortunately could not find the time to investigate the problem, but here are some comments in case they are of any help.

Apparently, something has changed internally in Asterisk between 1.8 and 13 versions in the way audio data is processed.

MPF callbacks in UniMRCP always carry audio data in 10 ms frames, irrespective of RTP ptime. In other words, if ptime is 20ms, then two frames are required to send an RTP packet out, and vice versa, an RTP packet of 20ms results in two frames of 10ms.

Given the logic above, 10ms frames are provided to Asterisk. It is up to Asterisk to compose an RTP stream for the incoming SIP leg based on negotiated parameters, including codec, ptime, etc, which does not seem to be properly reflected, based on your observations.

If you change an internal definition of CODEC_FRAME_TIME_BASE in mpf_codec_descriptor.h from 10 to 20, that would make the difference, I guess. However, this would not be a proper solution in general.

sfgeorge commented 4 years ago

@achaloyan No need to apologize... that's excellent news -- my preliminary debugging was leaning me to the same conclusion and it's a great help to have some validation in that direction.

I noticed that mpf_codec_frame_samples_calculate() is calculated based on CODEC_FRAME_TIME_BASE. I will explore whether I can update mpf_codec_frame_samples_calculate() (and related functions) to be driven by the current ptime rather than by CODEC_FRAME_TIME_BASE.

sfgeorge commented 4 years ago

It's looking like there are 2 potential fixes here. I feel like both should eventually be implemented, but solving either one will likely resolve our immediate issue.

Ensure that Outbound channels are inheriting all of the appropriate settings (ptime and smoothing in particular) as Inbound channels do. This 10ms phenomenon for us only occurs on Outbound channels -- and a key difference I'm noticing is that Inbound channels are enabled to use Asterisk's smoother, whereas Outbound channels are following the branch of logic in which smoothing is not applied. I believe the smoother is applying the critical last-chance healing of placing those smaller frames into properly-sized ones.
Change UniMRCP's internal jitter-buffer globbling clock/interval from a hard set 10ms to the value of ptime on the receiving channel. There's no sense in doing work every 10ms when we only want to ship out a packet every 20ms (unless we are trying to correct some serious jitter).

I plan to propose these as 2 distinct PRs - so we can discuss whether or not 1 or both fixes are appropriate.

achaloyan commented 4 years ago

There are quite significant changes introduced in Asterisk 13 in the way media is handled internally and in the public API as well. It is not clear to me what is the key difference, though.
This is not as obvious as it may seem. Please note that RTP receiver should be capable of receiving packets even with different size (ptime) in the scope of the same RTP session, regardless what ptime is. Furthermore, according RFC 3264 ptime indicates a desired packetization interval that the offerer would like to receive.

Anyway, I am certainly open to discuss any suggestions you may have.

sfgeorge commented 4 years ago

I've updated the description of this defect to note that it only occurs if you are using Asterisk's deprecated chan_sip channel driver; switching to chan_pjsip is one way to resolve the issue.

achaloyan commented 4 years ago

Well, thanks for the note.

unispeech / asterisk-unimrcp