crash in stream_write callback

achaloyan commented 9 years ago

Originally reported on Google Code with ID 75

What steps will reproduce the problem?
1. Run latest FreeSWITCH build with mod_unimrcp
2. Using MRCPv1 with Nuance Speech Server 5.0
3. Run about 500-700 TTS requests over a few hours
4. stream_write() callback is made with corrupted stream->obj data

Please use labels and text to provide additional information.

I have FSMOD-39 (http://jira.freeswitch.org/browse/FSMOD-39) opened to
track this in FreeSWITCH.  Still not sure whether the module or UniMRCP is
causing the problem.  I suspect it is mod_unimrcp's problem since it is
stream->obj that is corrupted, but I opened this ticket in case anyone has
any ideas of what might cause the problem.

Here's a backtrace:
(gdb) bt
#0  0x0000003e99008950 in pthread_mutex_lock () from /lib64/libpthread.so.0
#1  0x00002aaab55dfdcb in speech_channel_write (schannel=0x2aaab01a2ea0,
data=0x2aaaac1a5568, len=0x416b6010) at mod_unimrcp.c:1379
#2  0x00002aaab55dfede in synth_stream_write (stream=<value optimized out>,
frame=<value optimized out>) at mod_unimrcp.c:1882
#3  0x00002aaab5626fec in mpf_bridge_process (object=<value optimized out>)
at src/mpf_bridge.c:52
#4  0x00002aaab55f591e in mpf_context_process (context=0x2aaaac1a4330) at
src/mpf_context.c:432
#5  0x00002aaab55f5959 in mpf_context_factory_process (factory=0x101cf950)
at src/mpf_context.c:103
#6  0x00002aaab55f5610 in timer_thread_proc (thread=0x101cc1d0, data=<value
optimized out>) at src/mpf_scheduler.c:207
#7  0x0000003e990064a7 in start_thread () from /lib64/libpthread.so.0
#8  0x0000003e984d3c2d in clone () from /lib64/libc.so.6

(gdb) f 1
#1  0x00002aaab55dfdcb in speech_channel_write (schannel=0x2aaab01a2ea0,
data=0x2aaaac1a5568, len=0x416b6010) at mod_unimrcp.c:1379
1379    mod_unimrcp.c: No such file or directory.
        in mod_unimrcp.c
(gdb) p *len
$1 = 160
(gdb) p *((char *)data)
$2 = 0 '\0'
(gdb) p *schannel
$3 = {name = 0x6f732e646f72702e <Address 0x6f732e646f72702e out of bounds>,
profile = 0x2e65746962646e75, type = 980250467,
  application = 0x736568746e79732f, unimrcp_session = 0x5354522072657a69,
unimrcp_channel = 0x430a0d302e312f50,
  memory_pool = 0xa0d31203a716553, mutex = 0x726f70736e617254, cond =
0x412f505452203a74, state = 1966821462,
  audio_queue = 0x6e65696c633b7473, channel_opened = 1869635444, rate =
29810, silence = 758395445, params = 0x6e65746e6f430a0d,
  data = 0x203a657079542d74}
(gdb) p *schannel->unimrcp_channel
Cannot access memory at address 0x430a0d302e312f50
(gdb) p *schannel->unimrcp_session
Cannot access memory at address 0x5354522072657a69

Reported by cmrienzo on 2010-03-04 16:04:10

achaloyan commented 9 years ago

FYI this is with UniMRCP r1401.

Reported by cmrienzo on 2010-03-04 16:11:32

achaloyan commented 9 years ago

I noticed the app is much more stable when I disabled RTCP.  It has processed 1600
calls and 986 TTS requests with no crashes.

This reminds me of the first crash.  I noticed some mpf_timer activity going on after
a session was terminated.  Perhaps something didn't get cleaned up completely?

Reported by cmrienzo on 2010-03-04 19:44:20

achaloyan commented 9 years ago

Chris, thanks for the detailed analyzes.

I've not got a chance to look into this closer yet, but just wanted to suggest the
same to disable RTCP. If it helps, then either the problem is in processing of RTCP
messages or in timers. May I suggest to enable RTCP back, but disable RTCP receiver
by setting rx-resolution to 0
    <param name="rtcp-rx-resolution" value="0"/>
If the problem comes back, then the issue is in timers. 
I'm about to finalize my changes in trunk soon. Then, I'm planning to perform several
stress tests and will try with FS either.

Reported by achaloyan on 2010-03-04 20:25:13

Labels added: Type-Defect, Priority-Medium, OpSys-All, Component-FreeSWITCH

achaloyan commented 9 years ago

So, it just crashed again without the RTCP.  I guess I try again with valgrind running.

Reported by cmrienzo on 2010-03-04 20:46:51

achaloyan commented 9 years ago

Then, the problem could be in the same timeouts discussed in Issue-72.

I've noticed your comment on Jira
> 1. The callback is not expected as the MRCP session was destroyed

If callback is being called, it means session isn't destroyed yet internally in
client stack.

Reported by achaloyan on 2010-03-04 20:54:56

achaloyan commented 9 years ago

I found something fishy on the FS side... looks like a call terminates without
cleaning up the session.  I believe this is the cause of the problem since my
stream->obj was allocated off of the call's pool.

Reported by cmrienzo on 2010-03-05 13:04:25

achaloyan commented 9 years ago

The same with me. I've managed to reproduce it 2 times today. I made about 4000 calls
with SIPp -> FreeSWITCH -> UniMRCPServer and got the same backtrace. Analyzing logs,
I found a few suspicious things. See below filtered out, TTS-9973 related part.

 4098010  2010-03-05 16:47:48.144169 [DEBUG] mod_unimrcp.c:622 (TTS-9973) audio queue
created
 4098011  2010-03-05 16:47:48.144169 [NOTICE] mrcp_client.c:549 Create MRCP Handle
0x84a82b0 [unimrcpserver-mrcp1]
 4098012  2010-03-05 16:47:48.144169 [INFO] mrcp_client_session.c:142 Create Channel
0x84a82b0 <new>
 4098015  2010-03-05 16:47:48.144169 [INFO] mrcp_client_session.c:398 Receive App
Request 0x84a82b0 <new> [2]
 4098017  2010-03-05 16:47:48.144169 [INFO] mrcp_client.c:901 Add MRCP Handle 0x84a82b0
 4098018  2010-03-05 16:47:48.144169 [DEBUG] mrcp_client_session.c:1203 Dispatch
Application Request 0x84a82b0 <new> [2]
 4098019  2010-03-05 16:47:48.144169 [NOTICE] mrcp_client_session.c:718 Add Control
Channel 0x84a82b0 <new@speechsynth>
 4098020  2010-03-05 16:47:48.144169 [DEBUG] mrcp_client_session.c:762 Add RTP
Termination 0x84a82b0 <new>
 4098221  2010-03-05 16:47:48.184411 [DEBUG] mrcp_client_session.c:1039 On
Termination Add 0x84a82b0 <new>
 4098222  2010-03-05 16:47:48.184411 [DEBUG] mrcp_client_session.c:1039 On
Termination Add 0x84a82b0 <new>
 4098223  2010-03-05 16:47:48.184411 [INFO] mrcp_client_session.c:420 Send Offer
0x84a82b0 <new> [c:0 a:1 v:0]
 4098364  2010-03-05 16:47:48.208263 [INFO] mrcp_client_session.c:158 Receive Answer
0x84a82b0 <0532c32ad6abe740> [c:0 a:1 v:0]
 4098365  2010-03-05 16:47:48.208263 [DEBUG] mrcp_client_session.c:1160 Modify
Termination 0x84a82b0 <0532c32ad6abe740>

Pay attention to the delay here. You may see the same timed out request, but the big
question is what was the reason. 

 4115033  2010-03-05 16:47:58.152993 [ERR] mod_unimrcp.c:963 (TTS-9973) Timed out
waiting for channel to be ready
 4116285  2010-03-05 16:47:58.328997 [DEBUG] mrcp_client_session.c:1043 On
Termination Modify 0x84a82b0 <0532c32ad6abe740>

The expected "On Termination Modify" was eventually received. I suspect the MPF
callback was blocked from mod_unimrcp context, at least I see that all the other
threads of client stack were busy processing other messages.

 4116286  2010-03-05 16:47:58.328997 [INFO] mrcp_client_session.c:459 Raise App
Response 0x84a82b0 <0532c32ad6abe740> [2] SUCCESS [0]
 4116287  2010-03-05 16:47:58.328997 [DEBUG] mod_unimrcp.c:1752 (TTS-9973)
SYNTHESIZER channel is ready, codec = LPCM, sample rate = 8000
 4116288  2010-03-05 16:47:58.328997 [DEBUG] mod_unimrcp.c:1469 (TTS-9973) CLOSED ==>
READY

BTW, I have been doing a few other tests in parallel now and haven't seen this issue
on other products yet (not tried Asterisk though). So I suppose it's FS related.
Hope this helps a bit.

Reported by achaloyan on 2010-03-05 13:39:51

achaloyan commented 9 years ago

I have a fix to the FS core that I'm going to try today.  Found some cases where a FS
call can terminate without cleaning up its mod_unimrcp session.

Reported by cmrienzo on 2010-03-05 14:05:56

achaloyan commented 9 years ago

I'm going to make a change to mod_unimrcp so that it does not timeout when destroying
the session.  It will be better to let it stay stuck.  At least I can attach to the
process then and see what is going on.

Plus, we are now seeing how bad things can get if I let freeswitch destroy its
session without cleaning up the MRCP session.

Reported by cmrienzo on 2010-03-05 15:52:15

achaloyan commented 9 years ago

Agreed, as it's indeed a significant error condition.

Reported by achaloyan on 2010-03-05 16:27:15

achaloyan commented 9 years ago

Hello Arsen,

Tony checked in a fix to the FS core to prevent my crash and I checked in a fix to
mod_unimrcp to no longer allow timeout when terminating the session.

Could you send me a copy of your test setup?  SIPp script + command line args, FS
dialplan with TTS request, and UniMRCP server config.  I'll try to reproduce in the
lab.

Reported by cmrienzo on 2010-03-09 19:00:59

achaloyan commented 9 years ago

Hello Chris,

I'm glad the issue has been fixed and will try to re-test it on my end during the days.
At the moment, I have no access to the testbed to provide you with the exact setup.
Perhaps, there was nothing extraordinary. As far as I remember, I tried 5 calls per
sec with 3 sec call duration and typical FS TTS dialplan.

Reported by achaloyan on 2010-03-09 20:36:03

achaloyan commented 9 years ago

Just to let you know that I have also implemented Chris' fix into app_unimrcp.

Reported by thirionjwf on 2010-03-10 08:11:04

achaloyan commented 9 years ago

FYI,
I have made 50000 calls with FreeSWITCH trunk (r16966) today ... no more crashes! It
crashed on about 4000 calls before.

./sipp -r 5 -rp 1000 -d 2500

<extension name="unispeak">
   <condition field="destination_number" expression="^7004$">
        <action application="answer"/>
        <action application="set" data="tts_engine=unimrcp"/>
        <action application="speak" data="Hello world from FreeSWITCH"/>
   </condition>
</extension>

The same is true for Asterisk. I have tried the latest app_unimrcp (r1579) and
reached the same 50000 calls without any noticeable issues using MRCPSynth.

I'll try to do more tests and document setup when possible.

Reported by achaloyan on 2010-03-11 14:17:09

achaloyan commented 9 years ago

Chris, I have added you as a committer to the project. 
http://code.google.com/p/unimrcp/people/list

So you have privileged access to svn, wiki, tracker, etc. This would allow us to
collaborate more efficiency, I guess.

Glad to work with you!
Arsen

Reported by achaloyan on 2010-03-11 14:26:36

achaloyan commented 9 years ago

Thanks for the vote of confidence, Arsen. :)

I am currently ramping up the load on my production server and I haven't seen any
issues since the last round of fixes, either.  Once FreeSWITCH finishes their 1.0.5
release, I'll resume work on getting mod_unimrcp up to date with the latest rev of
UniMRCP.

Reported by cmrienzo on 2010-03-12 02:01:39

achaloyan commented 9 years ago

Well and thanks for the update, Chris.

So I consider the issue has been fixed.

Reported by achaloyan on 2010-03-12 12:39:01

Status changed: Fixed

achaloyan commented 9 years ago

Reported by thirionjwf on 2010-12-13 12:29:10

Status changed: Verified

unispeech / unimrcp

crash in stream_write callback #75