matiasgarciaisaia commented 2 years ago

See #778 for a similar issue.

We've observed some calls that stay with state active for weeks in Verboice, even if it was finished/cancelled/errored. The one we've just seen (call in Verboice, broker's logs below) was sent via Twilio (call SID CA381882b5e5f24f714d0ba28eee084f3c in InSTEDD 4 NCD project). The error was internal to the broker (maybe a communication issue with Surveda?), there are no errors seen in Twilio.

We should check what the error is, and how to properly handle it.

04:32:41.937 [info] [<0.4046.459>|8da0604c] session: Session ("190487cd-1fa9-4707-a971-b397630e4fbc") dial
                "verboice.surveda-mumbai.org/twilio?VerboiceSid=190487cd-1fa9-4707-a971-b397630e4fbc",
                "/twilio?VerboiceSid=190487cd-1fa9-4707-a971-b397630e4fbc",
                "POST /twilio?VerboiceSid=190487cd-1fa9-4707-a971-b397630e4fbc HTTP/1.1",
04:32:57.756 [info] [<0.4046.459>|8da0604c] session: Session ("190487cd-1fa9-4707-a971-b397630e4fbc") answer
                  "https://verboice.surveda-mumbai.org/twilio?VerboiceSid=190487cd-1fa9-4707-a971-b397630e4fbc"},
04:38:33.104 [error] [<0.24823.458>|91b9025e] session: Error during session "190487cd-1fa9-4707-a971-b397630e4fbc": exit:{timeout,{gen_server,call,[<0.29452.458>,{capture,{url,"https://surveda-mumbai.org/audio/2f7c7f6c-c76a-47d0-9652-2d6cdff40fa0.mp3"},5,[],1,infinity},300000]}}
04:38:33.105 [error] [emulator|undefined] unknown_module: Error in process <0.4328.459> on node 'verboice_broker@localhost' with exit value: {function_clause,[{http_uri,'-encode/1-lc$^0/1-0-',[{timeout,{gen_server,call,[<0.29452.458>,{capture,{url,"https://surveda-mumbai.org/audio/2f7c7f6c-c76a-47d0-9652-2d6cdff40fa0.mp3"},5,[],1,infinity},300000]}}],[{file,"http_uri.erl"},{... 
04:38:33.106 [info] [<0.4046.459>|8da0604c] session: Call failed with reason {internal_error,{timeout,{gen_server,call,[<0.29452.458>,{capture,{url,"https://surveda-mumbai.org/audio/2f7c7f6c-c76a-47d0-9652-2d6cdff40fa0.mp3"},5,[],1,infinity},300000]}}}
04:38:33.107 [warning] [<0.4046.459>|8da0604c] session: Session ("190487cd-1fa9-4707-a971-b397630e4fbc") terminated with reason: {timeout,{gen_server,call,[<0.29452.458>,{capture,{url,"https://surveda-mumbai.org/audio/2f7c7f6c-c76a-47d0-9652-2d6cdff40fa0.mp3"},5,[],1,infinity},300000]}}
04:38:33.107 [error] [<0.4046.459>|undefined] unknown_module: gen_fsm {session,[49,57,48,52,56,55,99,100,45,49,102,97,57,45,52,55,48,55,45,97,57,55,49,45,98,51,57,55,54,51,48,101,52,102,98,99]} in state in_progress terminated with reason: {timeout,{gen_server,call,[<0.29452.458>,{capture,{url,"https://surveda-mumbai.org/audio/2f7c7f6c-c76a-47d0-9652-2d6cdff40fa0.mp3"},5,[],1,infinity},300000]}}
04:38:33.107 [error] [<0.4046.459>|undefined] unknown_module: CRASH REPORT Process <0.4046.459> with 0 neighbours exited with reason: {timeout,{gen_server,call,[<0.29452.458>,{capture,{url,"https://surveda-mumbai.org/audio/2f7c7f6c-c76a-47d0-9652-2d6cdff40fa0.mp3"},5,[],1,infinity},300000]}} in gen_fsm:terminate/7 line 611
04:38:33.107 [error] [<0.145.0>|undefined] unknown_module: Supervisor session_sup had child "190487cd-1fa9-4707-a971-b397630e4fbc" started with {session,start_link,undefined} at <0.4046.459> exit with reason {timeout,{gen_server,call,[<0.29452.458>,{capture,{url,"https://surveda-mumbai.org/audio/2f7c7f6c-c76a-47d0-9652-2d6cdff40fa0.mp3"},5,[],1,infinity},300000]}} in context child_terminated

CC: @ggiraldez

ysbaddaden commented 2 years ago

It looks like a communication issue between verboice and surveda: the former failed to capture the MP3 file with a timeout. If I understand the code correctly, the Twilio PBX crashed, the session handler noticed it and logged an internal error, but the close state got never changed?

ysbaddaden commented 2 years ago

I forgot to say that I reviewed issues logged in Sentry and found nothing remotely related, hence the network issue.

ysbaddaden commented 2 years ago

Errata: the capture function has nothing to do with downloading the MP3 file from Surveda! It's instead trying to capture the respondent's reply through Twilio (for a given question) but it times out.

I believe this is this specific call (which in turn calls this in the actor):

https://github.com/instedd/verboice/blob/f6229be6303390351c5461df466447f36b78667a/broker/src/twilio/twilio_pbx.erl#L49-L53
Either Twilio never responds, or there was a network error, and/or the twilio_pbx actor crashed. In the end, the Broker eventually reaches the 5 minutes timeout while waiting a reply or hangup. It crashes the actor, and this bubbles up to the session.
It ends up calling session:finalize (with {failed, Reason}) where we properly report the error (Call failed with reason ~p), reschedule the call, and update the call log (though I'm not sure what's the state value):

https://github.com/instedd/verboice/blob/f6229be6303390351c5461df466447f36b78667a/broker/src/session/session.erl#L499
It eventually calls session:terminate (I don't know how) where we report the error again (session (~p) terminated with reason ~p), but may also enqueue a delayed_job (CallFlow::FusionTablesPush::Pusher)? The logic seems identical to CallLog#finish in Ruby's model.

https://github.com/instedd/verboice/blob/f6229be6303390351c5461df466447f36b78667a/broker/src/session/session.erl#L447-L453

Now, it seems Verboice considers the call to still be open, when the Broker properly failed and reported. Maybe the CallLog state is incorrectly updated in session:finalize ? What about the delayed job?

aeatencio commented 2 years ago

I tried hard to reproduce this bug with no success. I did reproduce the timeout error and got the same logs in this issue.

My strongest hypothesis is very simple: this update fails.

As a consequence of this failure, in the Mumbai instance, 0.01% of the started calls remained "active" when they actually failed.

SELECT sum(case when finished_at is null then 1 else 0 end) `unfinished`,
    count(1) `started`,
    sum(case when finished_at is null then 1 else 0 end) / count(1) * 100 as `percentage`
FROM `call_logs`
WHERE started_at is not null

|------------|---------|------------|
| unfinished | started | percentage |
|------------|---------|------------|
|        237 | 1725747 |     0.0137 |
|------------|---------|------------|

912 doesn't solve this failure. It will continue happening. But its consequences (the active calls) won't be there anymore.

ysbaddaden commented 2 years ago

huh... is it possible that we retain a MySQL connection which gets closed during the X minutes timeout? which is only noticed when we try to update (EPIPE)? could the connection be missing an auto-reconnect or something?

aeatencio commented 1 year ago

Today it seems had a couple of (consistent) occurrences of this issue in STG by hanging up calls using Callcentric. I share the logs. verboice_error_2.log verboice_error.log

ysbaddaden commented 1 year ago

As outlined by @ggiraldez, as soon as we get the hung up during the Gather operation, we try to update a row of call_logs, but it's failing, and the js_context column goes from null to be an Erlang error:

9/22/2022 9:53:16 AM                            {error,
9/22/2022 9:53:16 AM                             {unrecognized_value,
9/22/2022 9:53:16 AM                              {#Ref<0.0.931.17542>,
9/22/2022 9:53:16 AM                               {dict,1,16,16,8,80,48,
9/22/2022 9:53:16 AM                                {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
9/22/2022 9:53:16 AM                                 []},
9/22/2022 9:53:16 AM                                {{[],[],[],[],[],[],[],[],[],[],
9/22/2022 9:53:16 AM                                  [[#Ref<0.0.931.17542>|
9/22/2022 9:53:16 AM                                    {dict,5,16,16,8,80,48,
9/22/2022 9:53:16 AM                                     {[],[],[],[],[],[],[],[],[],[],[],[],[],
9/22/2022 9:53:16 AM                                      [],[],[]},
9/22/2022 9:53:16 AM                                     {{[[<<"hub_url">>]],
9/22/2022 9:53:16 AM                                       [],[],
9/22/2022 9:53:16 AM                                       [[<<"_get_var">>|
9/22/2022 9:53:16 AM                                         #Fun<session.17.60946945>],
9/22/2022 9:53:16 AM                                        [<<"split_digits">>|
9/22/2022 9:53:16 AM                                         #Fun<session.18.60946945>]],
9/22/2022 9:53:16 AM                                       [],[],
9/22/2022 9:53:16 AM                                       [[<<"phone_number">>,49,48,49]],
9/22/2022 9:53:16 AM                                       [[<<"record_url">>|
9/22/2022 9:53:16 AM                                         #Fun<session.16.60946945>]],
9/22/2022 9:53:16 AM                                       [],[],[],[],[],[],[],[]}}}]],
9/22/2022 9:53:16 AM                                  [],[],[],[],[]}}}}}},

ysbaddaden commented 1 year ago

I reproduced with the Twilio simulator:

The simulator decided that a bunch of of respondents wouldn't pick up the phone (i.e. no reply kicked in), and reported it by sending the no-answer status to verboice, but maybe Twilio doesn't do that, because verboice didn't care and sent the first question, expecting an answer.

Maybe this is how the "capture timeout" above triggers in: Twilio reports an error status to Url or <Redirect> that verboice overlooks, it sends the next question in response (that twilio discards, it got a 200 status delivering the status and it's happy), then Verboice waits for an answer that will never come -> timeout during capture.

instedd / verboice

Call stays in active state upon channel error #900

912 doesn't solve this failure. It will continue happening. But its consequences (the active calls) won't be there anymore.