unispeech / unimrcp

Open source cross-platform implementation of MRCP protocol
http://www.unimrcp.org
Apache License 2.0
364 stars 161 forks source link

crashes and garbage logging when running unimrcpclient tts demo #33

Closed achaloyan closed 9 years ago

achaloyan commented 9 years ago

Originally reported on Google Code with ID 33

What steps will reproduce the problem?
1. I modified unimrcp tts client to also test pause/resume/stop (see 
attached zip file)
2. Do (almost) like a monkey:
    in the client create multiple sessions fast:
    type: run synth and create multiple sessions (3 or 4 in my case will 
do) by hitting up arrow and enter (command line recall)
3. I experienced several crashes. I was running the client using the VS08 
debugger. 

What version of the product are you using? On what operating system?
Latest trunk release on win32 (r984)

Please provide any additional information below.

The attached unimrcp client log shows 
(1) garbage when logging the phrase of a 
    "Receive SIP Event [nua_i_state] Status 200 "
(2) The log ends when I experienced a crash in libsofia_sip_ua.dll

----

The above test was meant to just test my tts plugin (which I updated to 
the latest revisions) rather than testing the unimrcp client, but anyway, 
it's pretty well reproducable (on my 5 year old and slow laptop) so I 
thought I should report this. It might be related to issue with sofia sip 
as described in issue 20 !?

Additionally, I experienced some crashes before the one documented here 
above. These were probably not related to sofia-sip
I was able to identify some of them. I changed the code to prevent the 
crashes in the future as these were all dereferences of pointers to 0. I 
did not figure out what the exact scenario's were (it was monkey testing 
as described above) and I did not save any loggings.

The (modified) sources are in the attached zip file.

Reported by garmt.noname on 2009-06-13 21:09:25


achaloyan commented 9 years ago
Thanks for the issue report. If you experienced an issue, which is more or less
reproducibly, it makes sense to report it.
Running demo synth with your modifications, I've not got Sofia's status phrase
garbled, but I observed several issue, which should be separated.
1. Concurrency and delays caused by Sleep()
You should not use Sleep() in a thread serving multiple sessions, channels. This will
definitely not work the way you expected. However, if this helps identify problems
in
client stack, I never mind.
2. SPEAK-COMPLETE event and double mrcp_application_session_terminate() call.
Please note, mrcp_application_session_terminate() MUST be called only once,
afterward, as soon as session termination has been acknowledged by client stack,
session must be destroyed.
In fact, double mrcp_application_session_terminate() will be sent, if SPEAK-COMPLETE
event is received, and this definitely will cause memory related issues and crashes.
I believe you met this issue, as this is the only case, when sofia_session->mutex can
be NULL. Probably you should do nothing on SPEAK-COMPLETE in the modified scenario.
3. Active sessions and quit
Also you should not "quit" (destroy demo framework), while there are active sessions
in play. I guess this could be other problem you experienced with NULL pointer in
demo framework task.
4. Garbage in Sofia's status phrase string
I assume, this is the actual issue to reproduce. I wonder, if this will be still
reproducible to you, if you ensure point 2 and 3 never happens.
Also please indicate version of Sofia you use. I guess, it's 1.12.10

Reported by achaloyan on 2009-06-15 08:03:13

achaloyan commented 9 years ago
Thanks for your answer. I thought I managed to cut and paste the whole log from the

dos box, but apparently this was not the case. Yes, it is sophia sip 1.12.10 (with

the "improved" tport.c)

ad (1) I had some doubts about the sleeps. I was going to write about this, but then

since you suggested yourself to do this (in "http://unimrcp.googlegroups.com/web/
demo_synth_application.c") I continued. So I decided to make short sleeps, and since

I knew that streaming the generated TTS speech would take considerably longer than

those two sleeps, I supposed it would not be too bad. (I must say I initially 
thought that each session would have its own tasks/threads, and that multiple 
sessions would not have trouble with each other sessions' sleeps, although I did not

dive too deep into the code and logging to figure this out. (I should add an 
(optional) thread-id to the log function.)

The alternative solution would be to implement a "delayed send MRCP request, but 
conditionally to the current state", but for the purpose of my test (test my plugin)

that was a little too much.

2. SPEAK-COMPLETE event. I'm not sure I fully understand the scope of what you are

saying. In my test I would not expect to see a SPEAK-COMPLETE event at all.
And additionally, I was hoping they were mutually exclusive: a SPEAK-COMPLETE should

not arrive (or at least be processed) if a stop response was received (and vice 
versa).

In my case, the stop complete response should arrive well before the SPEAK-COMPLETE

event. Indeed, if I would start a lot of sessions, the sleeps might get the client

into trouble. In my log file I don't see a SPEAK-COMPLETE event ? 

Note that there is a kind of race condition here. You cannot be sure what is first:

the stop from the client or the speak-complete event from the server. 

I don't think I see this race condition in my log.

I don't understand the double mrcp_application_session_terminate() you see in the 
log. In my logging the session pointers for different sessions (one behind the 
finished/destroyed other) is the same, but the sessions are different.

I guess when I used your original pause/resume scenario I met this double 
mrcp_application_session_terminate() issue, and experienced the crashes with the 
null pointer (sofia->mutex). That's exactly why I commented out the 
mrcp_application_channel_remove(session,channel);

I'm not sure what my preference is - as a developer i would want my program to 
crash, so that I can improve it - but if i would run this in production i would 
probably prefer that a problem in one session would not have an adverse effect on 
the other sessions (or channels). It may turn out to be very difficult to implement

a proper state machine (client) in this multi-tasked, multi-session, multi-channel,

networked, time-constrained and real time environment. So i would prefer the "api"

to be kind of robust against some (hopefully most) cases - even if they are 
developers faults.

(3) as mentioned before, it may be hard to make a proper state machine: one may have

to do some timed garbage collection on sessions/channels as you can never be sure 
the server terminates a session. I would prefer to be able to stop and start the 
client without a crash taking place... I know this error processing is where it 
really becomes hard, and I realise you just supplied a client demo, not a finished

application.

I commented out the block of code after the resume state (so one delay less, no stop

message to be sent and only one mrcp_application_channel_remove(session,channel) -

after the speak complete event) and I still see the corruption. No crashed though.
It is occurs less frequently - though it's not so hard to be a succesful monkey.

Thanks for all your help. Perhaps more later this week...

Reported by garmt.noname on 2009-06-17 21:24:24

achaloyan commented 9 years ago
Garmt,
I assume you are on vocation, but when you'll be back, please retest the same with
the latest release. I suspect this issue relates to Issue-29 and APR pools thread safety.

Reported by achaloyan on 2009-07-23 05:02:41

achaloyan commented 9 years ago
Any updates? Is it reproducible with 0.7 release?

Reported by achaloyan on 2009-09-09 18:32:03

achaloyan commented 9 years ago
Last week I tried to reproduce the problem (without upgrading either trunk or 0.7 
release or APR). I failed to reproduce the problem (maybe I used release instead of

debug executables !?). I will give it another try soon.

Reported by garmt.noname on 2009-09-10 11:45:24

achaloyan commented 9 years ago
Thanks for the update, Garmt.

Reported by achaloyan on 2009-09-10 12:57:01

achaloyan commented 9 years ago
Summarizing the discussion below
http://groups.google.com/group/unimrcp/browse_thread/thread/8d756e7f15d37f98?hl=en

I admit, this issue indeed existed, but should be already fixed in the trunk.

Related changes are: r1155, r1170, r1135.

Reported by achaloyan on 2009-10-12 17:01:02

achaloyan commented 9 years ago

Reported by achaloyan on 2010-04-27 12:02:25